bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 355 forks source link

Default demini load fails in 1.0.7 germline variant2 GRCh37 for multisample projects because it uses a vcf file which is not annotated #2225

Closed naumenko-sa closed 6 years ago

naumenko-sa commented 6 years ago

Hello!

Thanks for the excellent pipeline.

In 1.0.7 gemini load command fails (unnecessary details omitted):

[2018-01-17T22:43Z] gemini  load  --passonly --skip-gerp-bp  -v work/gemini/1031-freebayes-nomultiallelic.vcf.gz -t VEP --cores 5 --tempdir ... 1031-freebayes.db

It looks like because it uses 1031-freebayes-nomultiallelic.vcf.gz, which is a soft link to 1031-freebayes-decompose.vcf.gz and it is not annotated.

Should be using 1031-freebayes-nomultiallelic-annotated-gemini.vcf.gz instead?

Could you please fix that?

Sergey

chapmanb commented 6 years ago

Sergey; Sorry about the issue. The older gemini load doesn't rely on pre-annotating so this should be doing the right thing. The confusion are that there is the old way (gemini load) and new way, which uses vcfanno and vcf2db. If you add vcfanno: [gemini] to your configuration it will use the new approach. Alternatively, what error message are you getting from the run? Thanks much for the help debugging.

naumenko-sa commented 6 years ago

Thanks Brad!

I was using the old way (default in GRCh37), and in 1.0.5 (and before), it first did VEP and then gemini load. In 1.0.7 with the same config it does:

vcfanno.go:241: annotated 248744 variants in 526.94 seconds (472.1 / second)
[2018-01-17T22:43Z] tabix index 1031-freebayes-nomultiallelic-annotated-gemini.vcf.gz
[2018-01-17T22:43Z] Create gemini database for 1031/work/gemini/1031-freebayes-nomultiallelic.vcf.gz : 1031_CH0068
[2018-01-17T22:45Z] CADD scores are being loaded (to skip use:--skip-cadd).
[2018-01-17T22:45Z] VEP: KeyError, did not find expected fields
ERROR: Check gemini docs for the recommended VCF annotation with VEP
http://gemini.readthedocs.org/en/latest/content/functional_annotation.html#stepwise-installation-and-usage-of-vep

So it annotates with vcfanno not VEP, but tries to load in the old way?

I've switched to the new way with vcfanno: [gemini] and it passed the loading step with vcf2db.

I'm fine with the new loader if it supports all VEP fields. Maybe it is just worth mentioning in the docs, that without vcfanno: [gemini] GRCh37 is not working anymore.

Sergey

naumenko-sa commented 6 years ago

Other projects finished just fine with the old loading method. So something was wrong with one project. Closing.

naumenko-sa commented 6 years ago

I have this error in every multi-sample project, i.e. when you have more than one sample in a batch, the old (default) gemini loading does not work. Adding vcfanno: [gemini] helps.

naumenko-sa commented 6 years ago

There is a difference between tables generated by vcf2db and gemini load that influences the downstream parsing, I put it here: https://github.com/quinlan-lab/vcf2db/issues/36.

So, if possible, it would be useful to get old gemini load working for multisample projects.

chapmanb commented 6 years ago

Sergey; Thanks for all the details and digging into this problem. Would you be able to share the header and first few line of a problem file that fails on gemini loading (1031/work/gemini/1031-freebayes-nomultiallelic.vcf.gz)? The error you're getting indicates that the VEP CSQ information tag is not present in the input file. I'm trying to see if that's the case and then we could try to track down why you're missing these and where it got removed. We shouldn't be stripping this when decomposing so trying to assess why it's problematic. Thanks for the help debugging.

naumenko-sa commented 6 years ago

Thanks Brad! It is a softlink 1031-freebayes-nomultiallelic.vcf.gz -> 1031-freebayes-decompose.vcf.gz

A header of the 1031-freebayes-decompose.vcf.gz

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##INFO=<ID=AB,Number=1,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
##INFO=<ID=ABP,Number=1,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AO,Number=1,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=CIGAR,Number=1,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
##INFO=<ID=DECOMPOSED,Number=0,Type=Flag,Description="The allele was parsed using vcfallelicprimitives.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=DPRA,Number=1,Type=Float,Description="Alternate allele depth ratio.  Ratio between depth in samples with each called alternate allele and those without.">
##INFO=<ID=END,Number=1,Type=Integer,Description="Last position (inclusive) in gVCF output record.">
##INFO=<ID=EPP,Number=1,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=GTI,Number=1,Type=Integer,Description="Number of genotyping iterations required to reach convergence or bailout.">
##INFO=<ID=LEN,Number=1,Type=Integer,Description="allele length">
##INFO=<ID=MEANALT,Number=1,Type=Float,Description="Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.">
##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">
##INFO=<ID=MQM,Number=1,Type=Float,Description="Mean mapping quality of observed alternate alleles">
##INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=NUMALT,Number=1,Type=Integer,Description="Number of unique non-reference alleles in called genotypes at this position.">
##INFO=<ID=ODDS,Number=1,Type=Float,Description="The log odds ratio of the best genotype combination to the second-best.">
##INFO=<ID=OLD_VARIANT,Number=.,Type=String,Description="Original chr:pos:ref:alt encoding">
##INFO=<ID=PAIRED,Number=1,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
##INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
##INFO=<ID=PAO,Number=1,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
##INFO=<ID=PQA,Number=1,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
##INFO=<ID=QA,Number=1,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=RPL,Number=1,Type=Float,Description="Reads Placed Left: number of reads supporting the alternate balanced to the left (5') of the alternate allele">
##INFO=<ID=RPP,Number=1,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPR,Number=1,Type=Float,Description="Reads Placed Right: number of reads supporting the alternate balanced to the right (3') of the alternate allele">
##INFO=<ID=RUN,Number=1,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
##INFO=<ID=SAF,Number=1,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAP,Number=1,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=SAR,Number=1,Type=Integer,Description="Number of alternate observations on the reverse strand">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=TYPE,Number=1,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=rs_ids,Number=.,Type=String,Description="calculated by concat of overlapping values in field ID from /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/dbsnp-150.vcf.gz">
##INFO=<ID=technology.illumina,Number=1,Type=Float,Description="Fraction of observations supporting the alternate observed in reads from illumina">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">
##FORMAT=<ID=AO,Number=1,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=QA,Number=1,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##bcftools_filterCommand=filter -i 'ALT="<*>" || QUAL > 5'; Date=Fri Jan 26 06:23:11 2018
##bcftools_filterVersion=1.6+htslib-1.6
##bcftools_viewCommand=view -s1031_CH0118,1031_CH0068,1031_CH0069 -a -; Date=Fri Jan 26 06:23:25 2018
##bcftools_viewVersion=1.6+htslib-1.6
##commandline="/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin/freebayes -f /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --genotype-qualities --strict-vcf --ploidy 2 --targets /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/freebayes/1/1031-1_0_15509225-regions.bed --min-repeat-entropy 1 --no-partial-observations -b /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/prealign/1031_CH0118/1031_CH0118.bam -b /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/prealign/1031_CH0068/1031_CH0068.bam -b /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/prealign/1031_CH0069/1031_CH0069.bam"
##fileDate=20180126
##phasing=none
##reference=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa
##source=freeBayes v1.1.0-46-g8d2b3a0-dirty
##VEP="v91" time="2018-01-26 08:40:25" cache="/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/vep/homo_sapiens_merged/91_GRCh37" ensembl=91.18ee742 ensembl-io=91.923d668 ensembl-variation=91.c78d8b4 ensembl-funcgen=91.4681d69 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" refseq="01_2015" regbuild="1.0" sift="sift5.2.2"
##LoF=Loss-of-function annotation (HC = High Confidence; LC = Low Confidence)
##LoF_filter=Reason for LoF not being HC
##LoF_flags=Possible warning flags for LoF
##LoF_info=Info used for LoF annotation
##MaxEntScan_alt=MaxEntScan alternate sequence score
##MaxEntScan_diff=MaxEntScan score difference
##MaxEntScan_ref=MaxEntScan reference sequence score
##SpliceRegion=SpliceRegion predictions
##FILTER=<ID=FBQualDepth,Description="Set if true: (AF[0] <= 0.5 && (DP < 4 || (DP < 13 && %QUAL < 10))) || (AF[0] > 0.5 && (DP < 4 && %QUAL < 50))">
##bcftools_filterCommand=filter -O v -T /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/bedprep/1031_CH0118-variant_regions.quantized-vrsubset-callableblocks.bed.gz --soft-filter FBQualDepth -e '(AF[0] <= 0.5 && (DP < 4 || (DP < 13 && %QUAL < 10))) || (AF[0] > 0.5 && (DP < 4 && %QUAL < 50))' -m + /hpf/largeprojects/ccmbio/naumenko/project_cheo/1031/work/freebayes/1031-vepeffects-annotated.vcf.gz; Date=Fri Jan 26 09:10:21 2018
##bcftools_viewCommand=view -f PASS,.; Date=Fri Jan 26 09:11:09 2018
##bcftools_annotateVersion=1.6+htslib-1.6
##bcftools_annotateCommand=annotate -x INFO/CSQ,INFO/ANN; Date=Fri Jan 26 09:11:09 2018
##INFO=<ID=OLD_MULTIALLELIC,Number=1,Type=String,Description="Original chr:pos:ref:alt encoding">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1031_CH0118 1031_CH0068 1031_CH0069
1   10261   .   T   TA  10.3    PASS    AC=2;AF=0.333333;AN=6;LEN=1;NS=3;TYPE=ins;DECOMPOSED    GT:AD:AO:DP:GQ:PL:QA:QR:RO  0|0:2,0:0:2:2:0,6,29:0:55:2 0|0:1,0:0:1:2:0,3,15:0:19:1 1|1:0,2:2:2:4:52,6,0:55:0:0
1   10329   rs150969722 AC  A   17.6    PASS    AC=3;AF=0.5;AN=6;LEN=1;NS=3;TYPE=del;DECOMPOSED GT:AD:AO:DP:GQ:PL:QA:QR:RO  0|1:1,1:1:2:0:19,0,6:28:12:1    1|1:0,2:2:3:17:13,6,0:20:0:0    0|0:0,0:0:1:0:0,0,0:0:0:0
1   12783   rs62635284  G   A   362.3   PASS    AB=0.733333;ABP=17.1973;AC=4;AF=0.666667;AN=6;AO=29;CIGAR=1X;DP=38;DPB=38;DPRA=0;EPP=19.8579;EPPR=5.18177;GTI=1;LEN=1;MEANALT=1;MQM=25.5862;MQMR=25;NS=3;NUMALT=1;ODDS=4.39326;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=857;QR=216;RO=9;RPL=7;RPP=19.8579;RPPR=5.18177;RPR=22;RUN=1;SAF=29;SAP=65.983;SAR=0;SRF=9;SRP=22.5536;SRR=0;TYPE=snp;technology.illumina=1 GT:AD:AO:DP:GQ:PL:QA:QR:RO  0/1:3,9:9:12:0:159,0,25:269:70:3    0/1:5,13:13:18:0:228,0,36:378:121:5 1/1:1,7:7:8:48:131,1,0:210:25:1
1   13116   rs62635286  T   G   954.8   PASS    AC=3;AF=0.5;AN=6;LEN=1;NS=3;TYPE=snp;DECOMPOSED GT:AD:AO:DP:GQ:PL:QA:QR:RO  0|1:14,20:20:35:99:236,0,249:602:423:14 0|1:12,36:36:49:99:574,0,176:1097:376:12    0|1:22,27:27:49:99:414,0,418:845:659:22
1   13118   rs62028691  A   G   954.8   PASS    AC=3;AF=0.5;AN=6;LEN=1;NS=3;TYPE=snp;DECOMPOSED GT:AD:AO:DP:GQ:PL:QA:QR:RO  0|1:14,20:20:35:99:236,0,249:602:423:14 0|1:12,36:36:49:99:574,0,176:1097:376:12    0|1:22,27:27:49:99:414,0,418:845:659:22
1   13302   rs75241669  C   T   8.5 PASS    AB=0.428571;ABP=3.63072;AC=2;AF=0.333333;AN=6;AO=6;CIGAR=1X;DP=18;DPB=18;DPRA=1.75;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=23;MQMR=39.5833;NS=3;NUMALT=1;ODDS=0.851501;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=168;QR=286;RO=12;RPL=6;RPP=16.0391;RPPR=5.9056;RPR=0;RUN=1;SAF=3;SAP=3.0103;SAR=3;SRF=6;SRP=3.0103;SRR=6;TYPE=snp;technology.illumina=1GT:AD:AO:DP:GQ:PL:QA:QR:RO 0/0:4,0:0:4:20:0,12,75:0:82:4   0/1:3,4:4:7:8:53,0,54:106:89:3  0/1:5,2:2:7:4:21,0,84:62:115:5
1   13656   .   CAG C   213.8   PASS    AC=3;AF=0.5;AN=6;LEN=2;NS=3;TYPE=del;DECOMPOSED GT:AD:AO:DP:GQ:PL:QA:QR:RO  0|1:12,11:11:23:88:98,0,193:246:379:12  0|1:4,12:12:16:6:113,0,19:263:122:40|1:6,12:12:18:58:132,0,75:284:193:6
1   131679  rs368243218 G   A   23.7    PASS    AB=0.5;ABP=3.0103;AC=3;AF=0.5;AN=6;AO=6;CIGAR=1X;DP=10;DPB=10;DPRA=4.5;EPP=16.0391;EPPR=5.18177;GTI=1;LEN=1;MEANALT=1;MQM=20.8333;MQMR=21.75;NS=3;NUMALT=1;ODDS=1.72124;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=191;QR=106;RO=4;RPL=1;RPP=8.80089;RPPR=11.6962;RPR=5;RUN=1;SAF=5;SAP=8.80089;SAR=1;SRF=3;SRP=5.18177;SRR=1;TYPE=snp;technology.illumina=1 GT:AD:AO:DP:GQ:PL:QA:QR:RO  0/0:1,0:0:1:0:0,3,22:0:32:1 0/1:3,3:3:6:1:44,0,38:95:74:3   1/1:0,3:3:3:18:53,9,0:96:0:0
1   133483  rs369820305 G   T   73.4    PASS    AB=0.5;ABP=3.0103;AC=2;AF=0.333333;AN=6;AO=8;CIGAR=1X;DP=28;DPB=28;DPRA=0.666667;EPP=3.0103;EPPR=6.91895;GTI=0;LEN=1;MEANALT=1;MQM=27.75;MQMR=23.6;NS=3;NUMALT=1;ODDS=7.62841;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=254;QR=556;RO=20;RPL=6;RPP=7.35324;RPPR=13.8677;RPR=2;RUN=1;SAF=4;SAP=3.0103;SAR=4;SRF=10;SRP=3.0103;SRR=10;TYPE=snp;technology.illumina=1GT:AD:AO:DP:GQ:PL:QA:QR:RO    0/1:2,3:3:5:33:45,0,30:90:49:2  0/1:6,5:5:11:71:91,0,87:164:160:6   0/0:12,0:0:12:48:0,36,214:0:347:12
1   133558  rs867188770 C   T   52.3    PASS    AB=0.5;ABP=3.0103;AC=2;AF=0.333333;AN=6;AO=7;CIGAR=1X;DP=23;DPB=23;DPRA=0.777778;EPP=3.32051;EPPR=3.55317;GTI=0;LEN=1;MEANALT=1;MQM=28.2857;MQMR=22.375;NS=3;NUMALT=1;ODDS=7.84285;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=196;QR=430;RO=16;RPL=3;RPP=3.32051;RPPR=7.89611;RPR=4;RUN=1;SAF=4;SAP=3.32051;SAR=3;SRF=4;SRP=11.6962;SRR=12;TYPE=snp;technology.illumina=1    GT:AD:AO:DP:GQ:PL:QA:QR:RO  0/1:2,3:3:5:31:47,0,27:76:48:2  0/1:5,4:4:9:45:63,0,74:120:128:5    0/0:9,0:0:9:36:0,27,148:0:254:9
1   569492  rs6594033   T   C   130.3   PASS    AB=0;ABP=0;AC=4;AF=1;AN=4;AO=11;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=12.6832;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=19.9091;MQMR=0;NS=3;NUMALT=1;ODDS=10.9477;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=363;QR=0;RO=0;RPL=11;RPP=26.8965;RPPR=0;RPR=0;RUN=1;SAF=2;SAP=12.6832;SAR=9;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.illumina=1  GT:AD:AO:DP:GQ:PL:QA:QR:RO  1/1:0,8:8:8:63:144,24,0:267:0:0 1/1:0,3:3:3:48:51,9,0:96:0:0    .:.:.:.:.:.:.:.:.

So it was indeed annotated with VEP, and the annotation then removed?

Sergey

chapmanb commented 6 years ago

Sergey; Thanks much for confirming the issue. I tracked down what I think is the underlying issue. If you also have ensemble calling enabled for this run bcbio is creating decomposed VCFs without effects as input to that (to avoid re-annotation errors) and then inadvertently re-using those for the population creation later. I pushed a fix to avoid this by naming the effects stripped and standard decomposition separately. If you update, remove the problematic *-decomposed.vcf.gz files, and restart it should hopefully correctly generate and use the right annotated files for GEMINI now. Thanks again for the report and hope this gets your analysis finished

naumenko-sa commented 6 years ago

Thanks Brad for looking into this issue!

I've upgraded to the latest dev and removed work/gemini dir. Unfortunately, is not working for me:

bcbio_nextgen.py --version
1.0.8a0

The same error, now

1031-freebayes-nomultiallelic.vcf.gz -> 1031-freebayes-noeff-decompose.vcf.gz

So it is a naked vcf file as input to gemini load, and it fails to load, because it wants VEP annotation.

I need the old loading, because vcfanno/vcf2db in bcbio produces different field names in gemini.db, and some fields are missing compared to VEP/gemini load (https://github.com/quinlan-lab/vcf2db/issues/36).

I definitely would like to switch to superfast vcf2db by default, but I need all fields. How you'd suggest to fix this? I see that I can modify gemini.conf/gemini.lua in GRCh37/config/vcfanno.

For now I'm just reloading the gemini database after bcbio run with gemini load using the VEP-annotated vcf, but I am looking for a more smooth solution.

SN

chapmanb commented 6 years ago

Sergey; Thanks for testing and reporting back and sorry about the continued issues. Apologies, but I should have also been having specific symlinks for the old and new effects files to avoid this problem. I pushed a fix that I think should finally handle this and let you cleanly create GEMINI databases with ensemble calls. Please let us know if you still run into any issues.

naumenko-sa commented 6 years ago

Thanks Brad. It works now. ps. I've upgraded conda update decorator to run the latest code.