Closed doranlab closed 5 years ago
Hi Lauryn,
Thanks for the question. Regarding question 1) : It looks like it is doing the right thing-- vardict is first in the list so vardict calls will get prioritized over the others. Swapping the order will get GATK calls prioritized, etc. The variant in the ensembl VCF file is the vardict one but decomposed into two variants.
@chapmanb will have to chime in regarding 2) though.
We would prefer to keep the phased variant and not decompose. Is there a current option to disable the decomposition?
Thank you! Lauryn
Lauryn and Rory; Thanks much for this question and discussion. For ensemble calling we don't have a way to disable decomposition and retain MNPs. The ensemble combining method is not intelligent enough to be able to group variants with different representations like the MNPs from VarDict and MuTect2 in your example versus the HaplotypeCaller single variant. Without decomposition, it would report both variants giving double calls at that region.
It's a much harder problem to try and handle this different representations generally and merge intelligently and the current ensemble implementation isn't set up to do that. Our best suggestion is to use individual caller outputs if you want to use non-decomposed inputs. Hope this helps.
Hi @chapmanb,
I noticed you recommended here to use a single caller in order to retain non-decomposed MNPs. Based on my experience (as described in https://github.com/bcbio/bcbio-nextgen/issues/2825), running Freebayes as the single variantcaller still ends up splitting MNPs into multiple SNPs tagged as DECOMPOSED
. This happens both with germline and with somatic (paired) VC.
Is there a way to retain the original MNPs?
Sorry for the confusion. Single caller output versus multiple ensemble output was referring to the splitting of multiallelic calls, which we do need for ensemble. We don't currently have an option for disabling MNP splitting. We had this in the past but FreeBayes output can look quite different than GATK or other callers with respect to variant output and it caused a lot of confusion in interpretation so we decided to focus on a single set of decomposed outputs as a help for using different methods within bcbio.
If someone wanted to revisit this we'd need to add a tools_off: [mnp_decomposition]
option and factor out the post cleaning done to FreeBayes output:
Thanks again for the discussion.
I see, thanks for clarifying. In the end it's fine for me to leave things as they are, since I managed to get WhatsHap working with the output of Freebayes and also with an ensemble VCF file.
Hi All,
We are trying to debug losing phased variants produced by mutect2 and vardict when we run the Callers in the order: [vardict, mutect2, gatk-haplotype].
Question 1 - phased variants We are seeing that when we have the variant callers in the above order, phased variants (MNVs, and delins variants) are not chosen above SNVs for the final ensemble vcf file. We do see that even if a vardict SNV has less support than a gatk variant for example, the vardict variant is chosen.
Question 2- split biallelic: We found that in bcbio/variants/normalize.py the "normalize" function sets split_biallelic=True by default and bcbio/variants/ensemble.py does not parse options to set this value to False. vrn_files = [normalize.normalize(f, data, passonly=passonly, rerun_effects=False, remove_oldeffects=True, nonrefonly=True, work_dir=utils.safe_makedir(os.path.join(base_dir, c)))
Example Case, ensemble only shows unphased gatk variant even though vardict was called first: sample-ensemble-annotated.vcf:2 212426699 . C A 235.0 PASS AC=1;AF=0.2763;CALLERS=vardict,mutect2,gatk-haplotype;DECOMPOSED;LEN=1;TYPE=snp GT:DP:VD:AD:AF:RD:ALD 0|1:257:71:183,71:0.2763:94,89:35,36 sample-ensemble-annotated.vcf:2 212426700 . C A 235.0 PASS AC=1;AF=0.2763;CALLERS=vardict,mutect2,gatk-haplotype;DECOMPOSED;LEN=1;TYPE=snp GT:DP:VD:AD:AF:RD:ALD 0|1:257:71:183,71:0.2763:94,89:35,36
sample-gatk-haplotype-annotated.vcf:2 212426699 . C A 1168.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-3.062;ClippingRankSum=-2.521;DP=207;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MMQ=60,60;MQ=60;MQ0=0;MQRankSum=0;QD=5.65;ReadPosRankSum=2.772;SOR=0.682;EPR=pass GT:AD:DP:GQ:PL 0/1:141,66:207:99:1176,0,5819 sample-gatk-haplotype-annotated.vcf:2 212426700 . C A 2382.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-2.969;ClippingRankSum=-2.318;DP=211;ExcessHet=3.0103;FS=0.536;MLEAC=1;MLEAF=0.5;MMQ=60,60;MQ=60;MQ0=0;MQRankSum=0;QD=11.29;ReadPosRankSum=2.694;SOR=0.752;EPR=pass GT:AD:DP:GQ:PL 0/1:142,69:211:99:2390,0,3841
sample-mutect2-annotated.vcf:2 212426699 . CC AA . PASS CONTQ=93;ClippingRankSum=-1.983;DP=141;ECNT=1;FS=4.592;GERMQ=70;MBQ=45,34;MFRL=134,127;MMQ=60,60;MPOS=44;MQ=60;MQ0=0;MQRankSum=0;POPAF=7.3;ReadPosRankSum=2.148;SAAF=0.273,0.242,0.281;SAPP=0.008253,0.028,0.964;TLOD=125.1;EPR=pass GT:AD:AF:DP:F1R2:F2R1 0/1:100,39:0.284:139:41,0:59,39
sample-vardict-annotated.vcf:2 212426699 . CC AA 235.0 PASS SAMPLE=NTP19-001706_S9_R1_001_val_1;TYPE=Complex;DP=257;VD=71;AF=0.2763;BIAS=2:2;REFBIAS=94:89;VARBIAS=35:36;PMEAN=36.8;PSTD=1;QUAL=38.3;QSTD=1;SBF=0.78145;ODDRATIO=1.086;MQ=60;SN=142;HIAF=0.2851;ADJAF=0.035;SHIFT3=0;MSI=2;MSILEN=1;NM=1.2;HICNT=71;HICOV=249;LSEQ=CTTGTGCTCGTGGACATACT;RSEQ=AACAGGCAGCCATGGGGCAT;DUPRATE=0;SPLITREAD=0;SPANPAIR=0;EPR=pass GT:DP:VD:AD:AF:RD:ALD 0/1:257:71:183,71:0.2763:94,89:35,36
In this post, another user found that the "passonly" was set as a default to False in this same line so that the "True" option could not be passed. Maybe a similar solution could be implemented for this? https://github.com/bcbio/bcbio-nextgen/issues/2196
Below is our yaml template file as well.
Thank you! Lauryn Keeler
Template for paired (tumor/normal) variant calling
details:
analysis: variant2 genome_build: GRCh37 metadata: batch: your-batch-name phenotype: tumor sex: algorithm: aligner: bwa background: /isilon/prod/apps/bcbio-data/genomes/Hsapiens/GRCh37/variation/1000G_omni2.5.vcf ensemble: numpass: 1 use_filtered: false mark_duplicates: true indelcaller: false
min_allele_fraction: 1 remove_lcr: true umi_type: fastq_name variant_regions: true variantcaller: [vardict,mutect2,gatk-haplotype]
coverage_interval: amplicon effects: false