bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

MNP decomposition after vardict in ensemble calling #2149

Closed tb08 closed 6 years ago

tb08 commented 6 years ago

Hello, I am running the tumor-only pipeline with ensemble calling using 4 variant callers (freebayes, gatk-haplotype, varscan, vardict). I have set numpass=1. I have noticed that in the final vcfs there are a number of MNPs (like TC > CG) but that these are exclusively called by vardict, while from other callers there are usually corresponding individual SNP calls at these positions (e.g T > C and C > G). To my knowledge, of the three other callers only freebayes is able to directly output MNPs but bcbio uses vcfallelicprimitives to decompose those right away. Should the same operation be performed on vardict calls for the ensemble comparison to be more complete? Thanks a lot for your help.

chapmanb commented 6 years ago

Thanks much for the feedback. You'e right sometimes VarDict produces biallelic MNPs. We've opted not to post-process these as it's not as aggressive with MNPs as FreeBayes and vcfallelicprimitives is imperfect in it's conversion individual SNP calls (it leaves some metrics incorrect).

However, we could normalize these prior to ensemble calling so we have flattened calls there. @vladsaveliev proposed normalizing multiallelic calls as well. Vlad what do you think about adding MNP splitting as part of the multiallelic normalization? Would that fit?

Thanks much for this discussion and ideas.

vladsavelyev commented 6 years ago

Hi all,

I added normalization and splitting of multiallelic variants, and can definitely add vcfallelicprimitives into the pre-ensemble normalization step. My only concern is that SnpEFF impact annotations for decomposed consecutive SNPs might be not as correct as those for an MNP. As a solution, we might merge SNPs back to MNPs after ensemble calling, and then rerun SnpEFF. Other INFO annotations might be also of concern, so it makes sense to even rerun vcfanno and prioritization again. However I'm not sure how deep we should dig into it, is this type of mutations frequent enough?

What are your suggestions regarding the annotation?

chapmanb commented 6 years ago

Vlad; Thanks so much for looking at this. I'd suggest leaving the post-ensemble calls as normalized instead of trying merge back together. Reducing the number of steps, and potential artifacts, is more important than the potential gain from snpEff annotations. Re-running effects on the normalized outputs of ensemble calling to synchronize impacts makes sense, but I don't think re-running prioritization is necessary since that already handles the inputs to the ensemble method. We'll run vcfanno again during the population database creation step so this will catch any changes there.

Thanks again for looking at this and all the great discussion.

chapmanb commented 6 years ago

Thanks again for the suggestion. Vlad implemented this in #2169 and it's not available in the latest development version:

bcbio_nextgen.py upgrade -u development

Please let us know if you run into any problems during testing or have other feedback. Thanks again.

tb08 commented 6 years ago

Hello thank you very much for implementing this so quickly. To answer @vladsaveliev question about the frequency of these mutations, they make about 2.3% of my unfiltered vardict variants and this in on data from a panel of 100 genes. I agree that the downside is that the annotations resulting from these decompositions will not be correct but that's how it is with the other variant callers today so in the ensemble approach I guess decomposing them is still what makes the most sense.