epi2me-labs / wf-human-variation

Other
96 stars 42 forks source link

Soft clipping for supplementary alignment: minimap2 -Y #115

Closed e-sollier closed 9 months ago

e-sollier commented 10 months ago

Is your feature related to a problem?

Hello,

I have a problem with methylation calls and supplementary alignments. The MM tags require the full read sequence to be correctly interpreted. However, the default behavior for minimap2 is to use hard clipping for supplementary alignments. Consequently, the modified basecalls are not usable for supplementary aligments, since we do not have the full sequence information to decode the MM tag. Some downstream tools like IGV apparently do not check for hard clipping and completely misinterpret the methylation calls. See the issue that I initially created on the remora repo for more information: https://github.com/nanoporetech/remora/issues/131.

Describe the solution you'd like

Minimap2 offers a -Y option to use soft clipping instead of hard clipping for supplementary alignments, which would make the methylation calls for supplementary alignments usable. I think this option should be added to the align_and_qsFilter step: either expose the option as an argument, or simply make it the default, which I think would make sense.

Describe alternatives you've considered

I think the soft clipping for supplementary alignments would be best, but otherwise I think that if hard clipping is used, the MM and ML tags should be removed from the reads, because otherwise they might confuse downstream tools like IGV.

Additional context

Thanks in advance for your help!

cjw85 commented 10 months ago

Hi @e-sollier,

As one of the people behind many of the discussions on the htslib Github Issues, I know well this issue. modkit (and its forerunner modbam2bed) ignores non-primary alignments when enumerating frequencies. This is largely OK, the primary issue is in the case of structural variation with alignment of a read becoming split into a primary and supplementary alignment. Secondary alignments are slightly different in that counting bases from them would have the major effect of leading to double counting.

I suspect enabling -Y would make the output files egregiously large for many users; if we were to implement a change here, the decision would be to perform these steps optionally. I think also that the considerations here are more general than this workflow; I'm thinking how much of this should be the responsibility of the underlying tools to resolve. For example:

Some downstream tools like IGV apparently do not check for hard clipping and completely misinterpret the methylation calls

is something that ought to be changed in IGV: even if we amend this workflow, IGV should still make allowance for the fact that the SEQ, CIGAR, and MM fields can become inconsistent after SAM records are (not)-handled by programs like minimap2.

e-sollier commented 10 months ago

Thanks for your reply!

I agree that secondary alignments are problematic and I am fine with them being ignored. However, I am interested in methylation around breakpoints so supplementary alignments are important to me. For each read that spans a breakpoint, there is one primary alignment on one side and one (shorter) supplementary alignment on the other side, so if we ignore supplementary alignments we essentially lose half of the methylation data around the breakpoint. Hence my suggestion to use soft clipping for supplementary alignments so that we can use the methylation information for them. But I agree that for most users supplementary alignments may not be very important, so if using soft clipping for supplementary alignments does substantially increase the bam file sizes it may not be best to use this as the default. I will test how much the minimap2 -Y option impacts the size of the bam files though, because considering that there are not that many supplementary alignments, I would not expect a very big increase.

The issue is indeed more general than this workflow. I will create an issue on the IGV repo about the problem.

Since you mentioned modkit: I had also noticed that it ignores supplementary alignments, and my next step would have been to ask if it would be possible to instead ignore alignments with hard clipping, so that if I do use minimap2 -Y, modkit would count the supplementary alignments ?

RenzoTale88 commented 9 months ago

@e-sollier at this stage in time, modkit supports only primary alignments, as described in the documentation. If you wish for this feature to be included, you'll have to open a new github issue here

You can still provide a softclipped aligned BAM file to the workflow though. It takes a bit more work, but you can proceed as follow:

  1. Basecall your data using wf-basecalling without providing a reference genome, producing an unmapped BAM file (uBAM).
  2. Align your uBAM manually using the command:
    samtools bam2fq -T 1 unmapped.bam | \
    minimap2 -Y -y -t 16 -ax map-ont reference.mmi - | \
    samtools sort -@3 > mapped.bam

Once these steps are done, the workflow will simply process your file without further processing of the reads and their tags.

e-sollier commented 9 months ago

Hi, thanks for the reply. I've created my own fork of the wf-human-variation to use soft-clipping for supplementary alignments. It does result in significantly larger bam files (from 161GB to 237GB for one test sample), so I agree that this should probably not be the default for most users, but perhaps adding it as an option could be useful for others.