Closed e-sollier closed 9 months ago
Hi @e-sollier,
As one of the people behind many of the discussions on the htslib Github Issues, I know well this issue. modkit
(and its forerunner modbam2bed
) ignores non-primary alignments when enumerating frequencies. This is largely OK, the primary issue is in the case of structural variation with alignment of a read becoming split into a primary and supplementary alignment. Secondary alignments are slightly different in that counting bases from them would have the major effect of leading to double counting.
I suspect enabling -Y
would make the output files egregiously large for many users; if we were to implement a change here, the decision would be to perform these steps optionally. I think also that the considerations here are more general than this workflow; I'm thinking how much of this should be the responsibility of the underlying tools to resolve. For example:
Some downstream tools like IGV apparently do not check for hard clipping and completely misinterpret the methylation calls
is something that ought to be changed in IGV: even if we amend this workflow, IGV should still make allowance for the fact that the SEQ, CIGAR, and MM fields can become inconsistent after SAM records are (not)-handled by programs like minimap2.
Thanks for your reply!
I agree that secondary alignments are problematic and I am fine with them being ignored. However, I am interested in methylation around breakpoints so supplementary alignments are important to me. For each read that spans a breakpoint, there is one primary alignment on one side and one (shorter) supplementary alignment on the other side, so if we ignore supplementary alignments we essentially lose half of the methylation data around the breakpoint. Hence my suggestion to use soft clipping for supplementary alignments so that we can use the methylation information for them. But I agree that for most users supplementary alignments may not be very important, so if using soft clipping for supplementary alignments does substantially increase the bam file sizes it may not be best to use this as the default. I will test how much the minimap2 -Y option impacts the size of the bam files though, because considering that there are not that many supplementary alignments, I would not expect a very big increase.
The issue is indeed more general than this workflow. I will create an issue on the IGV repo about the problem.
Since you mentioned modkit: I had also noticed that it ignores supplementary alignments, and my next step would have been to ask if it would be possible to instead ignore alignments with hard clipping, so that if I do use minimap2 -Y, modkit would count the supplementary alignments ?
@e-sollier at this stage in time, modkit supports only primary alignments, as described in the documentation. If you wish for this feature to be included, you'll have to open a new github issue here
You can still provide a softclipped aligned BAM file to the workflow though. It takes a bit more work, but you can proceed as follow:
samtools bam2fq -T 1 unmapped.bam | \
minimap2 -Y -y -t 16 -ax map-ont reference.mmi - | \
samtools sort -@3 > mapped.bam
Once these steps are done, the workflow will simply process your file without further processing of the reads and their tags.
Hi, thanks for the reply. I've created my own fork of the wf-human-variation to use soft-clipping for supplementary alignments. It does result in significantly larger bam files (from 161GB to 237GB for one test sample), so I agree that this should probably not be the default for most users, but perhaps adding it as an option could be useful for others.
Is your feature related to a problem?
Hello,
I have a problem with methylation calls and supplementary alignments. The MM tags require the full read sequence to be correctly interpreted. However, the default behavior for minimap2 is to use hard clipping for supplementary alignments. Consequently, the modified basecalls are not usable for supplementary aligments, since we do not have the full sequence information to decode the MM tag. Some downstream tools like IGV apparently do not check for hard clipping and completely misinterpret the methylation calls. See the issue that I initially created on the remora repo for more information: https://github.com/nanoporetech/remora/issues/131.
Describe the solution you'd like
Minimap2 offers a -Y option to use soft clipping instead of hard clipping for supplementary alignments, which would make the methylation calls for supplementary alignments usable. I think this option should be added to the align_and_qsFilter step: either expose the option as an argument, or simply make it the default, which I think would make sense.
Describe alternatives you've considered
I think the soft clipping for supplementary alignments would be best, but otherwise I think that if hard clipping is used, the MM and ML tags should be removed from the reads, because otherwise they might confuse downstream tools like IGV.
Additional context
Thanks in advance for your help!