Kennedy-Lab-UW / Duplex-Sequencing

Other
57 stars 34 forks source link

ConsensusMaker can read duplex tag sequences from a SAM tag (or tags). #24

Closed nh13 closed 8 years ago

scottrk commented 8 years ago

It's not entirely clear what the motivation for this is. Can you explain?

nh13 commented 8 years ago
  1. af1addc is a bug that I introduced. My apologies.
  2. 4060bed it is better practice to store molecular barcodes (molecular indexes) in the SAM tags rather than in the read name (this is how 10X genomics do it for long reads for example). I have a tool (see: https://github.com/fulcrumgenomics/fgbio/pull/14) that does precisely this. I want to run ConsensusMaker.py in a mode that retrieves the duplex molecular barcodes from those SAM tags instead of the read name. The default behavior remains: extract the sequence from the read name header.
nh13 commented 8 years ago

@loeblab let me know if there is anything that needs changing or clarifying. Thanks for considering this pull request!

scottrk commented 8 years ago

1) Can you post a line or two from an example sam file? I want to make sure what I have in my mind matches with what you're saying.

2) This change is somewhat redundant with the header info (as you correctly pointed out). However, in the case of consensus making, the header information from the raw reads used in the consensus process makes no sense to keep. That was the reason the header was replaced by the tag, read number, and family size info. I'm happy to merge it as long as it isn't going to mess up any of my downstream processes applications (see point 1).

nh13 commented 8 years ago

In this example, the ZA tag stores the 12bp duplex sequence from read one and the ZB tag stores the 12bp duplex sequence from read two.

AAAAA:3:11604:13785:10661       99      1       20313   0       108M    =       20532   327     TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC    EEEEEAEEEEEEEAEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEE6EEEEEEEEEEEEEEEEEEEE/EAEAEEEEAEEAE/EEEE6EEE<EEE    X0:i:4  X1:i:1  ZA:Z:TGGCAACAGCAT       ZB:Z:CCGAATTTAATC       MC:Z:108M       MD:Z:108        XG:i:0  AM:i:0  NM:i:0  SM:i:0  XM:i:0  XO:i:0  MQ:i:0  XT:A:R
AAAAA:3:21503:19476:19265       99      1       20313   0       108M    =       20532   327     TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC    EEEEEEEEEEEEEAEEEEEEEEAEEEEA/EEEEEEEEEEEEEEAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEAE/EEEEEEEEEEEEEEEE    X0:i:4  X1:i:1  ZA:Z:TGGCAACAGCAT       ZB:Z:CCGAATTTAATC       MC:Z:108M       MD:Z:108        XG:i:0  AM:i:0  NM:i:0  SM:i:0  XM:i:0  XO:i:0  MQ:i:0  XT:A:R

I want to keep the sequences in the tags as input to ConsensusMaker instead of in the read name. Modifying the read name may cause either errors or unexpected behavior in downstream toosl (try sorting with Picard's SortSam for example and run ValidateSamFile). Anyhow, the output of ConsensusMaker is not changed, and by default it parses the read name instead of using the tags, so your pipeline should not be affected.

scottrk commented 8 years ago

Ah, gotchya. I've worried about this in the past. It may make sense to have this become the default behavior in the future, but I'd have to think about what to put in as the header. If you update the documentation to reflect this new functionality, I'll happily merge this.

On 4/11/16 10:31 AM, Nils Homer wrote:

In this example, the |ZA| tag stores the 12bp duplex sequence from read one and the |ZB| tag stores the 12bp duplex sequence from read two.

|AAAAA:3:11604:13785:10661 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEAEEEEEEEAEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEE6EEEEEEEEEEEEEEEEEEEE/EAEAEEEEAEEAE/EEEE6EEE<EEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R AAAAA:3:21503:19476:19265 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEEEEEEEEEAEEEEEEEEAEEEEA/EEEEEEEEEEEEEEAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEAE/EEEEEEEEEEEEEEEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R |

I want to keep the sequences in the tags as /input/ to ConsensusMaker instead of in the read name. Modifying the read name may cause either errors or unexpected behavior in downstream toosl (try sorting with Picard's |SortSam| for example and run |ValidateSamFile|). Anyhow, the output of |ConsensusMaker| is not changed, and by default it parses the read name instead of using the tags, so your pipeline should not be affected.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/loeblab/Duplex-Sequencing/pull/24#issuecomment-208464804

nh13 commented 8 years ago

@loeblab thanks for the fast reply. I have added the documentation to the top of ConsensusMaker.py. Let me know if needs to go elsewhere.