Open genegolts opened 2 years ago
@genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x.
It's v. 2.0.2
Will need to fine to to look at this, that's going to be tough for this week. My apologies for the delay.
Can you run fgbio SetMateInformation
for now?
Just ran it. It outputs reads identical to my inputs. Thanks for looking into this!
Just to add visuals to the story, here's a slide describing the problem.
We've run into this bug as well. We noticed a drop in coverage after FilterDuplexConsensusReads
when upgrading to fgbio 2.0.2 despite GroupReadsByUmi
being more permissive (keeping pairs with one unmapped read) which resulted in more consensus reads going into that step.
Running each step on identical input, FilterDuplexConsensusReads
is unchanged, but CallDuplexConsensusReads
trims some pairs differently. I can't provide sequences here, but this information may help narrow it down. These inputs are from the output of GroupReadsByUmi
using fgbio v1.6
A pair that gets trimmed correctly
NB501910_HNMCVBGX7:2:11110:16491:14314 99 chr12 25245156 60 135M7S = 25245156 135
NB501910_HNMCVBGX7:2:11110:16491:14314 147 chr12 25245156 60 7S135M = 25245156 -135
A pair that results in one untrimmed read
Note that the problematic read has 79M
with an 81bp template size.
NB501910_HNMCVBGX7:1:11301:15626:18726 83 chr12 112489160 60 61S81M = 112489160 -81
NB501910_HNMCVBGX7:1:11301:15626:18726 163 chr12 112489160 60 79M63S = 112489160 81
Using fgbio v1.6 results in consensus pairs with 135bp reads and 81bp respectively.
Using fgbio v2 results in identical results except that the second read in the second pair is not clipped and is 142 bp long.
Our best guess is that this is related to this change: https://github.com/fulcrumgenomics/fgbio/pull/761
@genegolts would you mind testing your case with the changes from #842?
Pulled commit 3f1e664 and re-built the target. Still getting the same unevenly clipped consensus pair. Would it help if I uploaded the mini-bam file I'm testing this on?
@genegolts I think I can explain your case. During consensus making, the consensus maker will clip bases that extend past the mate. It does this by comparing the read's left-most position to the mate's left-most position (this change is introduced in fgbio v2). So when looking at the negative strand, it is comparing to the positive strand's primary alignment start position (which is half way through the read). So it then clips off all the bases in the negative strand that extend past that. This would not be fixed by #842 and will need a different fix, although still related to #761
@mjhipp Thank you for the explanation, it makes perfect sense. In my use case scenario, this behavior results in the sequence at chimeric or fusion breakpoints getting hard-clipped. I can see how in most situations you would indeed want to exclude unaligned sequence from the consensus, but in my case I actually don't want that to happen. I am wondering if I could make a feature request to add a parameter that would disable this type of hard-clipping.
In any case, thank you very much for answering my questions!
@genegolts feature requests welcome, though to be transparent we’re all volunteer and so there’d be no time line on when we’d look at it (unless it one of clients wanted us to look at it).
It's v. 2.0.2
On Mon, Apr 25, 2022 at 1:30 PM Nils Homer @.***> wrote:
@genegolts https://github.com/genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x.
— Reply to this email directly, view it on GitHub https://github.com/fulcrumgenomics/fgbio/issues/831#issuecomment-1109010643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY4NOR6NI5OJQZIQM2XSXFTVG36FXANCNFSM5UJR5UCQ . You are receiving this because you were mentioned.Message ID: @.***>
--
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at @. @.>.
Hello again. One common use case for consensus reads is to detect "split alignments" as a sign of structural variation in a genome, and in this scenario you would absolutely want unclipped consensus sequences going into the aligner. This presents a dilemma since in the beginning of the process, in order to generate umi groups, clipped aligned positions of the raw reads are used. I would still argue though that once the UMI group has been identified, the consensus generation step should be able to (optionally) use the entire sequence of the constituent reads, without aggressive hard-clipping. (More sensitive criteria could be added, for example, recognizing when the negative strand protrudes past the aligned start of the positive strand which is itself soft-clipped, i.e., evidence that the aligned start might not be the actual start of the molecule.)
In other words, the resulting duplex consensus sequence in this scenario would not necessarily represent the start/end of the molecule but would be a true consensus of the reads obtained from that molecule. I think this feature would be welcomed by many users and would add value to this already excellent tool! I'd love to hear your thoughts on this.
Hello fgbio folks, I am encountering an issue where one of the consensus sequences in a duplex pair gets clipped to the insert size while the other read doesn't get clipped. Here is an example. First, these are the UMI reads going in:
Notice that the pairs overlap each other almost fully, with a 1bp overhang. The insert size here was set based on the alignment, so are the soft-clip tags. This is not the true insert size, and I suspect this has something to do with the odd output
Here is the output of the program ( java -jar -Xmx32G -XX:ActiveProcessorCount=1 /home/ggoltsman/utils/fgbio/target/scala-2.13/fgbio-2.0.2-57a72b4-SNAPSHOT.jar CallDuplexConsensusReads --min-reads=4 0 0 --consensus-call-overlapping-bases false --input=UMI_1153019.s.bam ). The second consensus read is clipped to 73 bp while the first one is unclipped:
I've verified that if I change the insert size in the input bam to reflect the full read length then there is no clipping in the output. This is the behavior I want, and I'm perfectly happy to change my upstream steps to have the insert size reflect the reality, but I would really appreciate if someone explained what's behind the uneven clipping so that I understand the program better.
Thank you in advance!