Illumina / DRAGMAP

DRAGEN open-source mapper
Other
156 stars 31 forks source link

Missing ALT aware support #3

Closed MikeMallard closed 3 years ago

MikeMallard commented 3 years ago

I thought the benefit of the dragmap aligner was supposed to be its superior handling of ALT contigs in GRCh38 (compared to bwa mem). But the --ht-alt-liftover option does not look to be supported. I find "config->altLiftover" in the source code but no way to configure/enable it from the command line. Can anyone clarify if GRCh38 ALT-awareness is supported in the current release? And if yes, how to use it. And if no, if it is planned and for when? Thx.

MikeMallard commented 3 years ago

Hmmm. Very disappointing. Over a year ago it was advertised that the new DRAGEN-GATK pipeline would be made openly available . Here we are today with a crippled version and no response if advertised GRCh38 with ALTs is even planned to be released and supported.

rizkg commented 3 years ago

Hi, Really sorry about late answer on this subject. As you noticed, ALT aware is not supported through the liftover framework. It is however supported with ALT-masking. You can use the --ht-mask-bed option with a bed file to generate an ALT-masked hash table, for example with:

dragen-os --build-hash-table true --ht-reference hg38.fa  --output-directory /home/data/reference/ --output-file-prefix=dragmap.hg38_alt_masked --ht-mask-bed=fasta_mask/hg38_alt_mask.bed

We recently added the bed file you can use for hg38 under the fasta_mask/ folder. This will generate the exact same masked hash table as would be generated by hardware version of DRAGEN using same options.

MikeMallard commented 3 years ago

Thank you for your response! Forgive my wording above. It appeared this was purposely never going to get a response.

Thanks for comment about masking ALTs, but unless I'm misunderstanding alt-masking will give same result as using a reference without ALTs. That is, it only saves me from having to create or use a reference without the ALTs. Meaning there is still no way to get the advertised benefits of this DRAGEN mapper when ALTs are present. So in terms of improving false positives or false negatives in the final VCF there is no reason to use this mapper over existing bwa mem (maybe there is a runtime difference ... but I didn't understand that to be the goal since superior runtime is already achieved using the FPGA hardware accelerated version).

Is there a plan to add ALT-aware performance some day in the future?

MikeRuehle commented 3 years ago

Hi, I am the author of the FPGA DRAGEN mapper. Thanks for your very sensible inquiry.

The ALT-masked hash table support is not the same as using the reference without ALTs. Only strategic portions of the ALT contigs get masked. Segments which are very similar to the primary assembly are masked, so they do not compete and steal alignments or squash MAPQs. Segments which are quite different are left unmasked, functioning essentially as decoy sequences. Marginal regions have masked or unmasked status assigned by empirical impact on mapping accuracy.

This mask-based ALT awareness is not unique to software DRAGMAP, it is our recommended ALT-awareness strategy going forward with hardware DRAGEN too. It is slightly superior in overall accuracy compared to liftover-based ALT-awareness, end-to-end through small variant calling. It is true that improved accuracy using hg38 ALTs was advertised as a primary benefit of DRAGEN mapping vs. BWA, and this ALT-aware accuracy advantage is still present, even a bit greater. Sorry for the confusion arising from our switching ALT-awareness strategies near the same time as the DRAGMAP release. I also acknowledge the irony of singing the benefits of liftover-based ALT awareness, then switching to a mask-based system as even better. But the bottom line is that DRAGMAP should indeed allow you to use hg38 with ALT haplotypes and get the accuracy benefits of ALTs without the negative consequences.

If you want a little more inside baseball, while we achieved great accuracy improvements with liftover-based ALT awareness, it had its own stubborn issues. Mainly, you cannot everywhere trust liftover alignments of 5Mbp sequences. There are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is amazingly ambiguous. From time to time, we would discover another place where bad liftover caused mapping and VC issues, which tended to be local but severe. It is a painstaking process to diagnose such issues and determine the proper liftover patch. By contrast, ALT masks are much easier to define and maintain. We're likely to refine the masks further over time, but they've already surpassed liftover-based performance.

You may ask, what about the concept of liftover actively guiding reads which match ALT haplotypes to the correct primary assembly positions, which would otherwise have been ambiguous? This theoretical benefit of liftover-assisted mapping just turns out to be pretty tiny with the standard hg38 ALT haps. Almost all reads which match an hg38 ALT best and should map to the liftover position, also map perfectly well to the liftover position without any help or hinderance from the ALT. The hg38 ALTs provide substantially different population sequences occurring in some regions, but almost by definition, these "substantially different" ALTs aren't designed to disambiguate mapping among highly similar reference regions. The chief mapping benefit of hg38 ALTs is a different one: including the ALT portions (unmasked) which are quite dissimilar from any liftover position (often none or ambiguous) prevents reads matching these dissimilar ALT segments from mismapping elsewhere in the reference and generating false-positive variant calls. Only, we need a way to enable that important decoy action without the more-similar ALT segments interfering with mapping, competing with their similar primary regions. Previously, we handled the more-similar ALT segments by lifting them to primary regions where they wouldn't compete, but there were inherent dangers that we could lift inadvisably and cause new problems. Now, we don't have to choose which of N similar primary segments to lift to; we just mask off these more-similar ALT regions so they don't compete anywhere.

It IS possible to use liftover guidance to usefully disambiguate mapping -- not using the hg38 ALTs, but using carefully chosen population haplotype segments which do usefully distinguish among homologous regions. This is an area of busy research for us, and the fruits appear as our "graph" references, which indeed significantly improve mapping accuracy in difficult regions. Mask-based hg38 ALT-awareness also plays better with graph references, keeping out of the way to allow graph-path liftover to guide mapping without interference. Unfortunately, we currently support graph references only with hardware-accelerated DRAGEN. I can't advise you on graph mapping support appearing in software DRAGMAP. But with our various cloud offerings, it is pretty easy to use or try hardware DRAGEN without onsite hardware.

MikeMallard commented 3 years ago

Thank you! Thank you! Thank you! This all makes very good sense now. Thanks for explaining.