alimanfoo / pysamstats

A fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
192 stars 43 forks source link

Consensus variation stats for paired end? #60

Open astewart-twist opened 7 years ago

astewart-twist commented 7 years ago

Is it currently possible in pysamstats to generate consensus variation stats for paired end reads? As far as I can tell if a mismatch is on both reads for a given position, it gets counted twice.

alimanfoo commented 7 years ago

Hi Andrew, I'm not sure I understand what you mean. Could you elaborate?

On Wednesday, January 18, 2017, Andrew Stewart notifications@github.com wrote:

Is it currently possible in pysamstats to generate consensus variation stats for paired end reads? As far as I can tell if a mismatch is on both reads for a given position, it gets counted twice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qlk8DNEWuPvAZ1xJsOutKaQzhNmaks5rToGrgaJpZM4LnZ1E .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

astewart-twist commented 7 years ago

Sure. As far as I can tell, it looks like pysamstats considers the same insertion/deletion/mismatch observed on both forward and reverse reads at the same position as 2 separate observations. For example, in the following alignment...

fwd: ====C====
rev: ====G====

ref: ACGTATGCA

...pysamstats would report:

Biologically, one might want to interpret this as a single event. I'm just wondering if there exists an option in pysamstats to do that.

alimanfoo commented 7 years ago

Hi Andrew, no there is no option currently to generate a consensus sequence from variation stats. The variation stats count the total number of reads supporting a mismatch, indel etc. at each base. Is it just a consensus sequence you'd like to generate?

On Mon, Jan 23, 2017 at 7:30 PM, Andrew Stewart notifications@github.com wrote:

Sure. As far as I can tell, it looks like pysamstats considers the same insertion/deletion/mismatch observed on both forward and reverse reads at the same position as 2 separate observations. For example, in the following alignment...

fwd: ====C==== rev: ====G====

ref: ACGTATGCA

...pysamstats would report:

  • mismatches: 2
  • mismatches_fwd: 1
  • mismatches_rev: 1

Biologically, one might want to interpret this as a single event. I'm just wondering if there exists an option in pysamstats to do that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/60#issuecomment-274591997, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QjTx30mbrSvasdNQDSmzTImV1X4Sks5rVP_ngaJpZM4LnZ1E .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

astewart-twist commented 7 years ago

Not the consensus sequence, Alistair, just the consensus alignment stats. If you want to point to the relevant lines of code where something like that might best fit, I can try to take a stab at adding it as a new output (variation_consensus?).

On Tue, Jan 24, 2017 at 6:58 AM Alistair Miles notifications@github.com wrote:

Hi Andrew, no there is no option currently to generate a consensus sequence from variation stats. The variation stats count the total number of reads supporting a mismatch, indel etc. at each base. Is it just a consensus sequence you'd like to generate?

On Mon, Jan 23, 2017 at 7:30 PM, Andrew Stewart notifications@github.com wrote:

Sure. As far as I can tell, it looks like pysamstats considers the same insertion/deletion/mismatch observed on both forward and reverse reads at the same position as 2 separate observations. For example, in the following alignment...

fwd: ====C==== rev: ====G====

ref: ACGTATGCA

...pysamstats would report:

  • mismatches: 2
  • mismatches_fwd: 1
  • mismatches_rev: 1

Biologically, one might want to interpret this as a single event. I'm just wondering if there exists an option in pysamstats to do that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alimanfoo/pysamstats/issues/60#issuecomment-274591997>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QjTx30mbrSvasdNQDSmzTImV1X4Sks5rVP_ngaJpZM4LnZ1E

.

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721 <+44%201865%20287721>

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/60#issuecomment-274827608, or mute the thread https://github.com/notifications/unsubscribe-auth/ANnwFISuDw8Q24fns4P2jPOPCt7-9YFQks5rVhGvgaJpZM4LnZ1E .

alimanfoo commented 7 years ago

Hi Andrew,

Apologies, I am being dim and still not getting it. Could you walk me through a complete example? E.g., say a site has reference nucleotide "A", in the pileup there are 2 reads supporting the reference nucleotide, 3 reads supporting a mismatching nucleotide "C", 4 reads supporting a 2 bp insertion, and 5 reads supporting a 3 bp deletion (a bit crazy but not impossible). What would be the output from variation_consensus at this site?

The implementation of variation statistics starts from here: https://github.com/alimanfoo/pysamstats/blob/master/pysamstats.pyx#L684. Please feel free to submit a PR, and shout if you need any info on compiling or running the tests.

Cheers, Alistair

On Tuesday, January 31, 2017, Andrew Stewart notifications@github.com wrote:

Not the consensus sequence, Alistair, just the consensus alignment stats. If you want to point to the relevant lines of code where something like that might best fit, I can try to take a stab at adding it as a new output (variation_consensus?).

On Tue, Jan 24, 2017 at 6:58 AM Alistair Miles <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Hi Andrew, no there is no option currently to generate a consensus sequence from variation stats. The variation stats count the total number of reads supporting a mismatch, indel etc. at each base. Is it just a consensus sequence you'd like to generate?

On Mon, Jan 23, 2017 at 7:30 PM, Andrew Stewart < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Sure. As far as I can tell, it looks like pysamstats considers the same insertion/deletion/mismatch observed on both forward and reverse reads at the same position as 2 separate observations. For example, in the following alignment...

fwd: ====C==== rev: ====G====

ref: ACGTATGCA

...pysamstats would report:

  • mismatches: 2
  • mismatches_fwd: 1
  • mismatches_rev: 1

Biologically, one might want to interpret this as a single event. I'm just wondering if there exists an option in pysamstats to do that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/alimanfoo/pysamstats/issues/60#issuecomment-274591997 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ AAq8QjTx30mbrSvasdNQDSmzTImV1X4Sks5rVP_ngaJpZM4LnZ1E

.

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com javascript:_e(%7B%7D,'cvml','alimanfoo@googlemail.com'); Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721 <+44%201865%20287721>

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/60# issuecomment-274827608, or mute the thread https://github.com/notifications/unsubscribe-auth/ ANnwFISuDw8Q24fns4P2jPOPCt7-9YFQks5rVhGvgaJpZM4LnZ1E .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/pysamstats/issues/60#issuecomment-276519282, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qrg8xaedZGGYduw13w4XWfmkbfDEks5rX7rhgaJpZM4LnZ1E .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721