Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

Read duplication ? #22

Open karlkashofer opened 1 year ago

karlkashofer commented 1 year ago

I use your code in a exome pipeline which includes manta. Manta is super picky about the input bam files and it chokes on one of our current bams with a "Unexpected alignment name collision".

It seems to be caused by a read which is duplicated in the bam file:

`devarea@sx253:~/karl/PathoCromwell/cromwell-executions/Agilent_Exome_Single/43dccd69-b489-4c7b-b92b-bd9d00490356/call-AgilentDedup/execution$ samtools view D2758-8_Trimmed_Mapped_Deduped.bam chr22 | grep "A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT"

A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT 99 chr22 22734872 60 100M34S = 22734872 100 GTGACGTTGGTAGTTATAACCGTGTCTCCTGGTACCAGCAGCCCCCAGGCACAGCCCCCAAACTCATGATTTATGAGGTCAGTAATCGGCCCTCAGGGGTTTGTGCTGTACAAATACAAGCTCCTGCCACGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:100 MC:Z:100M34S AS:i:100 XS:i:72 RG:Z:D2758-8 SA:Z:chrX,43744471,+,95S39M,60,0; A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT 147 chr22 22734872 60 100M34S = 22734872 -100 GTGACGTTGGTAGTTATAACCGTGTCTCCTGGTACCAGCAGCCCCCAGGCACAGCCCCCAAACTCATGATTTATGAGGTCAGTAATCGGCCCTCAGGGGTTTGTGCTGTACAAATACAAGCTCCTGCCACGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:100 MC:Z:100M34S AS:i:100 XS:i:72 RG:Z:D2758-8 SA:Z:chrX,43744471,-,95S39M,60,0; A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT 147 chr22 22734872 60 100M34S = 22734872 -100 GTGACGTTGGTAGTTATAACCGTGTCTCCTGGTACCAGCAGCCCCCAGGCACAGCCCCCAAACTCATGATTTATGAGGTCAGTAATCGGCCCTCAGGGGTTTGTGCTGTACAAATACAAGCTCCTGCCACGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:100 MC:Z:100M34S AS:i:100 XS:i:72 RG:Z:D2758-8 SA:Z:chrX,43744471,-,95S39M,60,0; `

As you can see, the second read is present twice in the bam produced by umicollapse.

If i do the same grep on the input bam before umicollapse i dont see the duplicate read:

`devarea@sx253:~/karl/PathoCromwell/cromwell-executions/Agilent_Exome_Single/43dccd69-b489-4c7b-b92b-bd9d00490356/call-AgilentDedup/execution$ samtools view ../inputs/816700548/D2758-8.bam chr22 | grep "A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT"

A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT 99 chr22 22734872 60 100M34S = 22734872 100 GTGACGTTGGTAGTTATAACCGTGTCTCCTGGTACCAGCAGCCCCCAGGCACAGCCCCCAAACTCATGATTTATGAGGTCAGTAATCGGCCCTCAGGGGTTTGTGCTGTACAAATACAAGCTCCTGCCACGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:100 MC:Z:100M34S AS:i:100 XS:i:72 RG:Z:D2758-8 SA:Z:chrX,43744471,+,95S39M,60,0; A01664:83:HHTWYDMXY:2:2434:20672:19977_TAAGTCCTGT 147 chr22 22734872 60 100M34S = 22734872 -100 GTGACGTTGGTAGTTATAACCGTGTCTCCTGGTACCAGCAGCCCCCAGGCACAGCCCCCAAACTCATGATTTATGAGGTCAGTAATCGGCCCTCAGGGGTTTGTGCTGTACAAATACAAGCTCCTGCCACGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:100 MC:Z:100M34S AS:i:100 XS:i:72 RG:Z:D2758-8 SA:Z:chrX,43744471,-,95S39M,60,0; devarea@sx253:~/karl/PathoCromwell/cromwell-executions/Agilent_Exome_Single/43dccd69-b489-4c7b-b92b-bd9d00490356/call-AgilentDedup/execution$ `

Any idea what might have caused that ? regards, KK

Daniel-Liu-c0deb0t commented 1 year ago

Hm, that is definitely weird. I don't know what's wrong off the top of my head. What command did you run for UMICollapse? Does this happen for other bam records?