alimanfoo / pysamstats

A fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
189 stars 43 forks source link

Disparity between IGV output and pysamstats #52

Closed kwgsmith closed 7 years ago

kwgsmith commented 8 years ago

So I am working on getting data out of a large collection of bam files. Running the command pysamstats - f ref.fa --type variation -D depth file.bam > output.txt gives me a beautiful easy to parse output file.

However, double checking the numbers against IVG, your output gives me something completely different.

IVG Reads from base 1327: readsfromivg

Output from pysamstats for base 1327: incorrectinsertions

I'm unsure why the number of insertions would be incorrect, especially since all of the other stats are correct compared to IVG except for the insertions.

If you need more information I will provide it if I can.

alimanfoo commented 8 years ago

Thanks for reporting this, I'm on leave for a couple of weeks so won't be able to get to this quickly, apologies. I can't think of any obvious reason why insertions would be different, pysamstats is working off a pysam pileup column, details of how the pileup is constructed are within pysam.

If you could provide a small slice of a bam file where this problem manifests I'd be happy to do some digging when I'm back.

On Thursday, 24 March 2016, Kenneth Smith notifications@github.com wrote:

So I am working on getting data out of a large collection of bam files. Running the command pysamstats - f ref.fa --type variation -D depth file.bam > output.txt gives me a beautiful easy to parse output file.

However, double checking the numbers against IVG, your output gives me something completely different.

IVG Reads from base 1327: [image: readsfromivg] https://cloud.githubusercontent.com/assets/8976581/13992073/8fb196bc-f0e8-11e5-9368-5eb57c6f4000.png

Output from pysamstats for base 1327: [image: incorrectinsertions] https://cloud.githubusercontent.com/assets/8976581/13993741/9bce1fe0-f0ef-11e5-8b97-2b86083f50af.PNG

I'm unsure why the number of insertions would be incorrect, especially since all of the other stats are correct compared to IVG except for the insertions.

If you need more information I will provide it if I can.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

kwgsmith commented 8 years ago

I will ask my boss but as this is current research I'm not sure I can share the bam files. I can try to recreate the problem on a freely available bam file for you though.

Also, is there a way to have it get the data without filtering pcr duplicates? Because that seems to be what it is doing by default and I need it to do both filtered and unfiltered pcr duplicates but I don't see a flag to change that functionality. Does pysam do that already?

Thanks a ton for your quick responses, I really appreciate it!

alimanfoo commented 8 years ago

No worries. If you could provide a slice of a bam where you know what the answer should be at a particular position that would save me a bit of time, a small slice would be fine, no worries if not, I'm sure I can find a site with this problem in the test.bam file already in the repo.

I can't remember what the behaviour is re duplicates, again it'll be inside the pysam pileup function, which probably delegates to samtools, so the answer will be upstream somewhere. Would you mind raising that as a separate issue? Thanks.

On Fri, Mar 25, 2016 at 3:25 AM, Kenneth Smith notifications@github.com wrote:

I will ask my boss but as this is current research I'm not sure I can share the bam files. I can try to recreate the problem on a freely available bam file for you though.

Also, is there a way to have it get the data without filtering pcr duplicates? Because that seems to be what it is doing by default and I need it to do both filtered and unfiltered pcr duplicates but I don't see a flag to change that functionality. Does pysam do that already?

Thanks a ton for your quick responses, I really appreciate it!

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52#issuecomment-200982856

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

astewart-twist commented 8 years ago

Hey @kwgsmith , I don't know if this is a clue, but it looks like position 1326 in your pysamstats screenshot is reporting 11 insertions... the same number of insertions reported at position 1327 in your IVG screenshot. Is that just a coincidence ?

alimanfoo commented 8 years ago

Well spotted! Yes maybe indel counts are off by one, although other stats match up, not sure why that might be. Would be good to confirm this is not just a coincidence.

On Thursday, 2 June 2016, astewart-twist notifications@github.com wrote:

Hey @kwgsmith https://github.com/kwgsmith , I don't know if this is a clue, but it looks like position 1326 in your pysamstats screenshot is reporting 11 insertions... the same number of insertions reported at position 1327 in your IVG screenshot. Is that just a coincidence ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/52#issuecomment-223164906, or mute the thread https://github.com/notifications/unsubscribe/AAq8Qmx66yunoPJmkSlPv3lpILKpAcQKks5qHiSYgaJpZM4H3R3K .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 7 years ago

FWIW there are modifications in #67 adding a "no_dup" option to enable filtering of PCR duplicates (which is disabled by default.) Also it looks like insertions are reported one-off relative to IGV, happy to revisit if that behaviour is wrong for any reason. Closing for now but please re-open if need to revisit.