fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
315 stars 67 forks source link

CallDuplexConsensusReads hard-clips ONE of the two duplex reads #831

Open genegolts opened 2 years ago

genegolts commented 2 years ago

Hello fgbio folks, I am encountering an issue where one of the consensus sequences in a duplex pair gets clipped to the insert size while the other read doesn't get clipped. Here is an example. First, these are the UMI reads going in:

@HD VN:1.6 SO:unsorted GO:query SS:unsorted:template-coordinate @SQ SN:22 LN:51304566 @SQ SN:hs37d5 LN:35477943 @RG ID:A SM:GT-30-2903_EOT-plasma LB:GT-30-2903_EOT-plasma PL:illumina A00138:929:HF2Y7DSX3:2:1353:11930:36558 99 22 24313293 60 68S73M1S = 24313293 73 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGGA FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476687,+,65M77S,60,0; MC:Z:69S73M MD:Z:74 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:74 XS:i:19 RX:Z:CTA-GGT A00138:929:HF2Y7DSX3:2:1353:11930:36558 147 22 24313293 60 69S73M = 24313293 -73 ATTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476686,-,66M76S,60,0; MC:Z:68S73M1S MD:Z:73 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:73 XS:i:19 RX:Z:CTA-GGT A00138:929:HF2Y7DSX3:2:1377:26422:28087 99 22 24313293 60 68S73M1S = 24313293 73 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGGA FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476687,+,65M77S,60,0; MC:Z:69S73M MD:Z:74 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:74 XS:i:19 RX:Z:CTA-GGT A00138:929:HF2Y7DSX3:2:1377:26422:28087 147 22 24313293 60 69S73M = 24313293 -73 ATTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476686,-,66M76S,60,0; MC:Z:68S73M1S MD:Z:73 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:73 XS:i:19 RX:Z:CTA-GGT

Notice that the pairs overlap each other almost fully, with a 1bp overhang. The insert size here was set based on the alignment, so are the soft-clip tags. This is not the true insert size, and I suspect this has something to do with the odd output

Here is the output of the program ( java -jar -Xmx32G -XX:ActiveProcessorCount=1 /home/ggoltsman/utils/fgbio/target/scala-2.13/fgbio-2.0.2-57a72b4-SNAPSHOT.jar CallDuplexConsensusReads --min-reads=4 0 0 --consensus-call-overlapping-bases false --input=UMI_1153019.s.bam ). The second consensus read is clipped to 73 bp while the first one is unclipped:

GT-30-2903_EOT-plasma:1153019 77 0 0 0 0 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqfqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqfqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq aD:i:5 bD:i:1 cD:i:6 aE:f:0 bE:f:0 cE:f:0 RG:Z:A MI:Z:1153019 aM:i:5 bM:i:1 cM:i:6 RX:Z:CTA-GGT ac:Z:TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG bc:Z:TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG ad:B:s,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 bd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ae:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,be:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 aq:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN bq:Z:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD9DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD9DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD GT-30-2903_EOT-plasma:1153019 141 0 0 0 0 CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq aD:i:5 bD:i:1 cD:i:6 aE:f:0 bE:f:0 cE:f:0 RG:Z:A MI:Z:1153019 aM:i:5 bM:i:1 cM:i:6 RX:Z:CTA-GGT ac:Z:CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG bc:Z:CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG ad:B:s,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 bd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,ae:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 be:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 aq:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN bq:Z:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

I've verified that if I change the insert size in the input bam to reflect the full read length then there is no clipping in the output. This is the behavior I want, and I'm perfectly happy to change my upstream steps to have the insert size reflect the reality, but I would really appreciate if someone explained what's behind the uneven clipping so that I understand the program better.

Thank you in advance!

nh13 commented 2 years ago

@genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x.

genegolts commented 2 years ago

It's v. 2.0.2

nh13 commented 2 years ago

Will need to fine to to look at this, that's going to be tough for this week. My apologies for the delay.

nh13 commented 2 years ago

Can you run fgbio SetMateInformation for now?

genegolts commented 2 years ago

Just ran it. It outputs reads identical to my inputs. Thanks for looking into this!

genegolts commented 2 years ago

Just to add visuals to the story, here's a slide describing the problem. image

jrm5100 commented 2 years ago

We've run into this bug as well. We noticed a drop in coverage after FilterDuplexConsensusReads when upgrading to fgbio 2.0.2 despite GroupReadsByUmi being more permissive (keeping pairs with one unmapped read) which resulted in more consensus reads going into that step.

Running each step on identical input, FilterDuplexConsensusReads is unchanged, but CallDuplexConsensusReads trims some pairs differently. I can't provide sequences here, but this information may help narrow it down. These inputs are from the output of GroupReadsByUmi using fgbio v1.6

A pair that gets trimmed correctly

NB501910_HNMCVBGX7:2:11110:16491:14314  99      chr12   25245156        60      135M7S  =       25245156        135     
NB501910_HNMCVBGX7:2:11110:16491:14314  147     chr12   25245156        60      7S135M  =       25245156        -135 

A pair that results in one untrimmed read

Note that the problematic read has 79M with an 81bp template size.

NB501910_HNMCVBGX7:1:11301:15626:18726  83      chr12   112489160       60      61S81M  =       112489160       -81     
NB501910_HNMCVBGX7:1:11301:15626:18726  163     chr12   112489160       60      79M63S  =       112489160       81      

Using fgbio v1.6 results in consensus pairs with 135bp reads and 81bp respectively.

Using fgbio v2 results in identical results except that the second read in the second pair is not clipped and is 142 bp long.

Our best guess is that this is related to this change: https://github.com/fulcrumgenomics/fgbio/pull/761

mjhipp commented 2 years ago

@genegolts would you mind testing your case with the changes from #842?

genegolts commented 2 years ago

Pulled commit 3f1e664 and re-built the target. Still getting the same unevenly clipped consensus pair. Would it help if I uploaded the mini-bam file I'm testing this on?

mjhipp commented 2 years ago

@genegolts I think I can explain your case. During consensus making, the consensus maker will clip bases that extend past the mate. It does this by comparing the read's left-most position to the mate's left-most position (this change is introduced in fgbio v2). So when looking at the negative strand, it is comparing to the positive strand's primary alignment start position (which is half way through the read). So it then clips off all the bases in the negative strand that extend past that. This would not be fixed by #842 and will need a different fix, although still related to #761

genegolts commented 2 years ago

@mjhipp Thank you for the explanation, it makes perfect sense. In my use case scenario, this behavior results in the sequence at chimeric or fusion breakpoints getting hard-clipped. I can see how in most situations you would indeed want to exclude unaligned sequence from the consensus, but in my case I actually don't want that to happen. I am wondering if I could make a feature request to add a parameter that would disable this type of hard-clipping.
In any case, thank you very much for answering my questions!

nh13 commented 2 years ago

@genegolts feature requests welcome, though to be transparent we’re all volunteer and so there’d be no time line on when we’d look at it (unless it one of clients wanted us to look at it).

genegolts commented 2 years ago

It's v. 2.0.2

On Mon, Apr 25, 2022 at 1:30 PM Nils Homer @.***> wrote:

@genegolts https://github.com/genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x.

— Reply to this email directly, view it on GitHub https://github.com/fulcrumgenomics/fgbio/issues/831#issuecomment-1109010643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY4NOR6NI5OJQZIQM2XSXFTVG36FXANCNFSM5UJR5UCQ . You are receiving this because you were mentioned.Message ID: @.***>

--

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential  information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at @. @.>.

genegolts commented 2 years ago

Hello again. One common use case for consensus reads is to detect "split alignments" as a sign of structural variation in a genome, and in this scenario you would absolutely want unclipped consensus sequences going into the aligner. This presents a dilemma since in the beginning of the process, in order to generate umi groups, clipped aligned positions of the raw reads are used. I would still argue though that once the UMI group has been identified, the consensus generation step should be able to (optionally) use the entire sequence of the constituent reads, without aggressive hard-clipping. (More sensitive criteria could be added, for example, recognizing when the negative strand protrudes past the aligned start of the positive strand which is itself soft-clipped, i.e., evidence that the aligned start might not be the actual start of the molecule.)

In other words, the resulting duplex consensus sequence in this scenario would not necessarily represent the start/end of the molecule but would be a true consensus of the reads obtained from that molecule. I think this feature would be welcomed by many users and would add value to this already excellent tool! I'd love to hear your thoughts on this.