DNPs are improrperly annotated as SNPs

genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.

MIT License

102 stars 57 forks source link

DNPs are improrperly annotated as SNPs #777

Open zlskidmore opened 5 years ago

zlskidmore commented 5 years ago

Leaving this reminder to myself,

In the workflows I've run I've encountered multiple cases where two SNPs are annotated separately however, in reality, they are DNPs and therefore should have different consequence annotations than what is reported.

There is a tool here to take care of this https://github.com/hubentu/MAC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4521406/

We should incorporate something like this in the workflows.

chrisamiller commented 4 years ago

This is becoming more important for clinical applications, including cancer vaccine design, where manual review has caught several MNPs annotated as separate SNPs, requiring reannotation and reevaluation of peptides/binding. We should add a step that attempts to resolve them. Does anyone know of a tool that does this? Perhaps something could use gatk haplotypecaller to do this?

malachig commented 4 years ago

The proximal variants functionality of pvactools will fix the peptides that come out. But I agree that it would nice if they were just represented properly in the VCF.

m-two commented 4 years ago

Qingsong Gao, Ph.D. who was in Li’s lab and moved on to St. Jude had a solution to merge adjacent SNPs if they were in phase into DNPs, MNPs, ONPs, etc. It was called COCOONS but I don’t see the code in their repo anymore.. I think it was merged into their somatic wrapper https://github.com/ding-lab/TinDaisy …look here: https://github.com/ding-lab/mnp_filter

malachig commented 4 years ago

Another option to fix VCFs with phasing information (as possible from Mutect2): https://github.com/Sentieon/sentieon-scripts/tree/master/merge_mnp

It may also possible to tackle this problem at the annotation stage:

bcftools csq (haplotype aware consequence caller) works with VCFs with phased information represented in a particular way.
MAC (Multi-nucleotide Variant Annotation Corrector) use both the VCF and BAM to correct for MNVs.
MACARON (Multi-bAse Codon Association variant ReannotatiON) uses the VCF and BAM to re-annotate VCFs with corrected MNVs.

chrisamiller commented 3 years ago

Revisiting this issue today. In the pipeline's current state, we get duplicate calls, with the DNPs coming from Mutect, and SNPs coming from another caller. This is objectively wrong and we need to handle this merging properly.

chr17   7673787 G   A
chr17   7673787 GG  AA
chr17   7673788 G   A
chr17   7675993 C   T
chr17   7675993 CC  TT
chr17   7675994 C   T

susannasiebert commented 1 year ago

One thing to be aware of is that when these DNPs are merged, bam-readcount will no longer be able to process them.

malachig commented 1 year ago

Does bam-readcount not have the ability to count such things under its in/del support. A DNP is essentially a delins. Maybe bam-readcount doesn't support those either...

chrisamiller commented 1 year ago

It does not, and even multi-bp indels are kind of iffy with bam-readcount, given that it's not doing any local realignment or anything, as the variant callers often do. For years, we've discussed possible alternatives to bam-readcount, such as prioritizing the values from one caller or another in different situations, but it's one of those seemingly simple things that gets kind of complicated down in the weeds, and no one has had the bandwidth to implement something.