Closed kwgsmith closed 7 years ago
Thanks for reporting this, I'm on leave for a couple of weeks so won't be able to get to this quickly, apologies. I can't think of any obvious reason why insertions would be different, pysamstats is working off a pysam pileup column, details of how the pileup is constructed are within pysam.
If you could provide a small slice of a bam file where this problem manifests I'd be happy to do some digging when I'm back.
On Thursday, 24 March 2016, Kenneth Smith notifications@github.com wrote:
So I am working on getting data out of a large collection of bam files. Running the command pysamstats - f ref.fa --type variation -D depth file.bam > output.txt gives me a beautiful easy to parse output file.
However, double checking the numbers against IVG, your output gives me something completely different.
IVG Reads from base 1327: [image: readsfromivg] https://cloud.githubusercontent.com/assets/8976581/13992073/8fb196bc-f0e8-11e5-9368-5eb57c6f4000.png
Output from pysamstats for base 1327: [image: incorrectinsertions] https://cloud.githubusercontent.com/assets/8976581/13993741/9bce1fe0-f0ef-11e5-8b97-2b86083f50af.PNG
I'm unsure why the number of insertions would be incorrect, especially since all of the other stats are correct compared to IVG except for the insertions.
If you need more information I will provide it if I can.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
I will ask my boss but as this is current research I'm not sure I can share the bam files. I can try to recreate the problem on a freely available bam file for you though.
Also, is there a way to have it get the data without filtering pcr duplicates? Because that seems to be what it is doing by default and I need it to do both filtered and unfiltered pcr duplicates but I don't see a flag to change that functionality. Does pysam do that already?
Thanks a ton for your quick responses, I really appreciate it!
No worries. If you could provide a slice of a bam where you know what the answer should be at a particular position that would save me a bit of time, a small slice would be fine, no worries if not, I'm sure I can find a site with this problem in the test.bam file already in the repo.
I can't remember what the behaviour is re duplicates, again it'll be inside the pysam pileup function, which probably delegates to samtools, so the answer will be upstream somewhere. Would you mind raising that as a separate issue? Thanks.
On Fri, Mar 25, 2016 at 3:25 AM, Kenneth Smith notifications@github.com wrote:
I will ask my boss but as this is current research I'm not sure I can share the bam files. I can try to recreate the problem on a freely available bam file for you though.
Also, is there a way to have it get the data without filtering pcr duplicates? Because that seems to be what it is doing by default and I need it to do both filtered and unfiltered pcr duplicates but I don't see a flag to change that functionality. Does pysam do that already?
Thanks a ton for your quick responses, I really appreciate it!
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52#issuecomment-200982856
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
Hey @kwgsmith , I don't know if this is a clue, but it looks like position 1326 in your pysamstats screenshot is reporting 11 insertions... the same number of insertions reported at position 1327 in your IVG screenshot. Is that just a coincidence ?
Well spotted! Yes maybe indel counts are off by one, although other stats match up, not sure why that might be. Would be good to confirm this is not just a coincidence.
On Thursday, 2 June 2016, astewart-twist notifications@github.com wrote:
Hey @kwgsmith https://github.com/kwgsmith , I don't know if this is a clue, but it looks like position 1326 in your pysamstats screenshot is reporting 11 insertions... the same number of insertions reported at position 1327 in your IVG screenshot. Is that just a coincidence ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52#issuecomment-223164906, or mute the thread https://github.com/notifications/unsubscribe/AAq8Qmx66yunoPJmkSlPv3lpILKpAcQKks5qHiSYgaJpZM4H3R3K .
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721
FWIW there are modifications in #67 adding a "no_dup" option to enable filtering of PCR duplicates (which is disabled by default.) Also it looks like insertions are reported one-off relative to IGV, happy to revisit if that behaviour is wrong for any reason. Closing for now but please re-open if need to revisit.
So I am working on getting data out of a large collection of bam files. Running the command pysamstats - f ref.fa --type variation -D depth file.bam > output.txt gives me a beautiful easy to parse output file.
However, double checking the numbers against IVG, your output gives me something completely different.
IVG Reads from base 1327:
Output from pysamstats for base 1327:
I'm unsure why the number of insertions would be incorrect, especially since all of the other stats are correct compared to IVG except for the insertions.
If you need more information I will provide it if I can.