bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Ensemble VCF Variant Phasing and split_biallelic #2768

Closed doranlab closed 5 years ago

doranlab commented 5 years ago

Hi All,

We are trying to debug losing phased variants produced by mutect2 and vardict when we run the Callers in the order: [vardict, mutect2, gatk-haplotype].

Question 1 - phased variants We are seeing that when we have the variant callers in the above order, phased variants (MNVs, and delins variants) are not chosen above SNVs for the final ensemble vcf file. We do see that even if a vardict SNV has less support than a gatk variant for example, the vardict variant is chosen.

Question 2- split biallelic: We found that in bcbio/variants/normalize.py the "normalize" function sets split_biallelic=True by default and bcbio/variants/ensemble.py does not parse options to set this value to False. vrn_files = [normalize.normalize(f, data, passonly=passonly, rerun_effects=False, remove_oldeffects=True, nonrefonly=True, work_dir=utils.safe_makedir(os.path.join(base_dir, c)))

Example Case, ensemble only shows unphased gatk variant even though vardict was called first: sample-ensemble-annotated.vcf:2 212426699 . C A 235.0 PASS AC=1;AF=0.2763;CALLERS=vardict,mutect2,gatk-haplotype;DECOMPOSED;LEN=1;TYPE=snp GT:DP:VD:AD:AF:RD:ALD 0|1:257:71:183,71:0.2763:94,89:35,36 sample-ensemble-annotated.vcf:2 212426700 . C A 235.0 PASS AC=1;AF=0.2763;CALLERS=vardict,mutect2,gatk-haplotype;DECOMPOSED;LEN=1;TYPE=snp GT:DP:VD:AD:AF:RD:ALD 0|1:257:71:183,71:0.2763:94,89:35,36

sample-gatk-haplotype-annotated.vcf:2 212426699 . C A 1168.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-3.062;ClippingRankSum=-2.521;DP=207;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MMQ=60,60;MQ=60;MQ0=0;MQRankSum=0;QD=5.65;ReadPosRankSum=2.772;SOR=0.682;EPR=pass GT:AD:DP:GQ:PL 0/1:141,66:207:99:1176,0,5819 sample-gatk-haplotype-annotated.vcf:2 212426700 . C A 2382.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-2.969;ClippingRankSum=-2.318;DP=211;ExcessHet=3.0103;FS=0.536;MLEAC=1;MLEAF=0.5;MMQ=60,60;MQ=60;MQ0=0;MQRankSum=0;QD=11.29;ReadPosRankSum=2.694;SOR=0.752;EPR=pass GT:AD:DP:GQ:PL 0/1:142,69:211:99:2390,0,3841

sample-mutect2-annotated.vcf:2 212426699 . CC AA . PASS CONTQ=93;ClippingRankSum=-1.983;DP=141;ECNT=1;FS=4.592;GERMQ=70;MBQ=45,34;MFRL=134,127;MMQ=60,60;MPOS=44;MQ=60;MQ0=0;MQRankSum=0;POPAF=7.3;ReadPosRankSum=2.148;SAAF=0.273,0.242,0.281;SAPP=0.008253,0.028,0.964;TLOD=125.1;EPR=pass GT:AD:AF:DP:F1R2:F2R1 0/1:100,39:0.284:139:41,0:59,39

sample-vardict-annotated.vcf:2 212426699 . CC AA 235.0 PASS SAMPLE=NTP19-001706_S9_R1_001_val_1;TYPE=Complex;DP=257;VD=71;AF=0.2763;BIAS=2:2;REFBIAS=94:89;VARBIAS=35:36;PMEAN=36.8;PSTD=1;QUAL=38.3;QSTD=1;SBF=0.78145;ODDRATIO=1.086;MQ=60;SN=142;HIAF=0.2851;ADJAF=0.035;SHIFT3=0;MSI=2;MSILEN=1;NM=1.2;HICNT=71;HICOV=249;LSEQ=CTTGTGCTCGTGGACATACT;RSEQ=AACAGGCAGCCATGGGGCAT;DUPRATE=0;SPLITREAD=0;SPANPAIR=0;EPR=pass GT:DP:VD:AD:AF:RD:ALD 0/1:257:71:183,71:0.2763:94,89:35,36

In this post, another user found that the "passonly" was set as a default to False in this same line so that the "True" option could not be passed. Maybe a similar solution could be implemented for this? https://github.com/bcbio/bcbio-nextgen/issues/2196

Below is our yaml template file as well.

Thank you! Lauryn Keeler

Template for paired (tumor/normal) variant calling


details:

roryk commented 5 years ago

Hi Lauryn,

Thanks for the question. Regarding question 1) : It looks like it is doing the right thing-- vardict is first in the list so vardict calls will get prioritized over the others. Swapping the order will get GATK calls prioritized, etc. The variant in the ensembl VCF file is the vardict one but decomposed into two variants.

@chapmanb will have to chime in regarding 2) though.

doranlab commented 5 years ago

We would prefer to keep the phased variant and not decompose. Is there a current option to disable the decomposition?

Thank you! Lauryn

chapmanb commented 5 years ago

Lauryn and Rory; Thanks much for this question and discussion. For ensemble calling we don't have a way to disable decomposition and retain MNPs. The ensemble combining method is not intelligent enough to be able to group variants with different representations like the MNPs from VarDict and MuTect2 in your example versus the HaplotypeCaller single variant. Without decomposition, it would report both variants giving double calls at that region.

It's a much harder problem to try and handle this different representations generally and merge intelligently and the current ensemble implementation isn't set up to do that. Our best suggestion is to use individual caller outputs if you want to use non-decomposed inputs. Hope this helps.

amizeranschi commented 5 years ago

Hi @chapmanb,

I noticed you recommended here to use a single caller in order to retain non-decomposed MNPs. Based on my experience (as described in https://github.com/bcbio/bcbio-nextgen/issues/2825), running Freebayes as the single variantcaller still ends up splitting MNPs into multiple SNPs tagged as DECOMPOSED. This happens both with germline and with somatic (paired) VC.

Is there a way to retain the original MNPs?

chapmanb commented 5 years ago

Sorry for the confusion. Single caller output versus multiple ensemble output was referring to the splitting of multiallelic calls, which we do need for ensemble. We don't currently have an option for disabling MNP splitting. We had this in the past but FreeBayes output can look quite different than GATK or other callers with respect to variant output and it caused a lot of confusion in interpretation so we decided to focus on a single set of decomposed outputs as a help for using different methods within bcbio.

If someone wanted to revisit this we'd need to add a tools_off: [mnp_decomposition] option and factor out the post cleaning done to FreeBayes output:

https://github.com/bcbio/bcbio-nextgen/blob/bafc19aa6b391ae3c59abc01fe22e711453b0929/bcbio/variation/freebayes.py#L136

Thanks again for the discussion.

amizeranschi commented 5 years ago

I see, thanks for clarifying. In the end it's fine for me to leave things as they are, since I managed to get WhatsHap working with the output of Freebayes and also with an ensemble VCF file.