Closed nh13 closed 8 years ago
af1addc
is a bug that I introduced. My apologies.4060bed
it is better practice to store molecular barcodes (molecular indexes) in the SAM tags rather than in the read name (this is how 10X genomics do it for long reads for example). I have a tool (see: https://github.com/fulcrumgenomics/fgbio/pull/14) that does precisely this. I want to run ConsensusMaker.py
in a mode that retrieves the duplex molecular barcodes from those SAM tags instead of the read name. The default behavior remains: extract the sequence from the read name header.@loeblab let me know if there is anything that needs changing or clarifying. Thanks for considering this pull request!
1) Can you post a line or two from an example sam file? I want to make sure what I have in my mind matches with what you're saying.
2) This change is somewhat redundant with the header info (as you correctly pointed out). However, in the case of consensus making, the header information from the raw reads used in the consensus process makes no sense to keep. That was the reason the header was replaced by the tag, read number, and family size info. I'm happy to merge it as long as it isn't going to mess up any of my downstream processes applications (see point 1).
In this example, the ZA
tag stores the 12bp duplex sequence from read one and the ZB
tag stores the 12bp duplex sequence from read two.
AAAAA:3:11604:13785:10661 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEAEEEEEEEAEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEE6EEEEEEEEEEEEEEEEEEEE/EAEAEEEEAEEAE/EEEE6EEE<EEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R
AAAAA:3:21503:19476:19265 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEEEEEEEEEAEEEEEEEEAEEEEA/EEEEEEEEEEEEEEAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEAE/EEEEEEEEEEEEEEEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R
I want to keep the sequences in the tags as input to ConsensusMaker instead of in the read name. Modifying the read name may cause either errors or unexpected behavior in downstream toosl (try sorting with Picard's SortSam
for example and run ValidateSamFile
). Anyhow, the output of ConsensusMaker
is not changed, and by default it parses the read name instead of using the tags, so your pipeline should not be affected.
Ah, gotchya. I've worried about this in the past. It may make sense to have this become the default behavior in the future, but I'd have to think about what to put in as the header. If you update the documentation to reflect this new functionality, I'll happily merge this.
On 4/11/16 10:31 AM, Nils Homer wrote:
In this example, the |ZA| tag stores the 12bp duplex sequence from read one and the |ZB| tag stores the 12bp duplex sequence from read two.
|AAAAA:3:11604:13785:10661 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEAEEEEEEEAEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEE6EEEEEEEEEEEEEEEEEEEE/EAEAEEEEAEEAE/EEEE6EEE<EEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R AAAAA:3:21503:19476:19265 99 1 20313 0 108M = 20532 327 TAGGAAAAATCAGAGCCAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCATCCAGTTCATTC EEEEEEEEEEEEEAEEEEEEEEAEEEEA/EEEEEEEEEEEEEEAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEAE/EEEEEEEEEEEEEEEE X0:i:4 X1:i:1 ZA:Z:TGGCAACAGCAT ZB:Z:CCGAATTTAATC MC:Z:108M MD:Z:108 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 MQ:i:0 XT:A:R |
I want to keep the sequences in the tags as /input/ to ConsensusMaker instead of in the read name. Modifying the read name may cause either errors or unexpected behavior in downstream toosl (try sorting with Picard's |SortSam| for example and run |ValidateSamFile|). Anyhow, the output of |ConsensusMaker| is not changed, and by default it parses the read name instead of using the tags, so your pipeline should not be affected.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/loeblab/Duplex-Sequencing/pull/24#issuecomment-208464804
@loeblab thanks for the fast reply. I have added the documentation to the top of ConsensusMaker.py
. Let me know if needs to go elsewhere.
It's not entirely clear what the motivation for this is. Can you explain?