brentp / duphold

don't get DUP'ed or DEL'ed by your putative SVs.
MIT License
101 stars 9 forks source link

Error running Duphold, "expected AD field in snps VCF" #30

Open johnemajor opened 4 years ago

johnemajor commented 4 years ago

I'm running duphold v 0.2.1

The command I am running is: ~/install_crap/duphold/duphold -v ./results.vcf.gz -b /efs/WGS/data/WGS/ILMN_exptA/b37/KC_downsampledBAM_and_VCF/NA24385/40x/S7508/NA24385.40x.S7508.aligned.deduped.sort.bam -f /efs/WGS/data/reference/human/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta -s ./x.vcf.gz -t 96 -o ./duphold.vcf

It spins for a while, then returns the error: "expected AD field in snps VCF"

the thing is, my VCF files have the AD field present..... I created a small vcf to test this more directly, and here are the full contents of the x.vcf.gz

`##fileformat=VCFv4.3

reference=human_g1k_v37_modified

octopus=<version=0.6.3-beta_HEAD_5961a546,command="octopus --reference /home/kcibul/wgs_resources/data/reference/human/human_g1k_v37_modified.fasta\

/human_g1k_v37_modified.fasta --reads NA24385.40x.S7508.aligned.deduped.sort.bam -t regions.bed --forest-file /home/kcibul/wgs_resources/data/referen\ ce/forests/DC/germline.v0.7.0.forest -o NA24385.40x.S7508.octopus.tmp.vcf.gz --threads 192 -X 10000MB -B 38400 MB --sequence-error-model /home/kcibul\ /wgs_resources/data/reference/octopus_err_models/novaseq.4a38e55.model --annotations AD ADP AF ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD QUAL SB STR_LE\ NGTH STR_PERIOD --max-indel-errors 32 --duplicate-read-detection-policy AGGRESSIVE --max-haplotypes=400 --min-forest-quality=8",options="--allow-mark\ ed-duplicates no --allow-octopus-duplicates no --allow-pileup-candidates-from-likely-misaligned-reads no --allow-qc-fails no --allow-reads-with-good-\ decoy-supplementary-alignments no --allow-reads-with-good-unplaced-or-unlocalized-supplementary-alignments no --allow-secondary-alignments no --allow\ -supplementary-alignments no --annotations[0] AD --annotations[1] ADP --annotations[2] AF --annotations[3] ARF --annotations[4] BQ --annotations[5] C\ RF --annotations[6] DP --annotations[7] FRF --annotations[8] GC --annotations[9] GQ --annotations[10] MC --annotations[11] MF --annotations[12] MQ --\ annotations[13] MQ0 --annotations[14] QD --annotations[15] QUAL --annotations[16] SB --annotations[17] STR_LENGTH --annotations[18] STR_PERIOD --asse\ mble-all no --assembler-mask-base-quality 10 --backtrack-level NONE --bad-region-tolerance NORMAL --bamout-type MINI --caller population --consider-u\ nmapped-reads no --contig-output-order REFERENCE_INDEX --contig-ploidies[0] Y=1 --contig-ploidies[1] chrY=1 --contig-ploidies[2] MT=1 --contig-ploidi\ es[3] chrM=1 --denovo-filter-expression QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF \

0.5 | MP < 30 | MQ0 > 2 --denovo-indel-mutation-rate 1e-09 --denovo-snv-mutation-rate 1.3e-08 --denovos-only no --disable-adapter-masking no --disa\ ble-assembly-candidate-generator no --disable-call-filtering no --disable-denovo-variant-discovery no --disable-downsampling no --disable-inactive-fl\ ank-scoring no --disable-overlap-masking no --disable-pileup-candidate-generator no --disable-read-preprocessing no --disable-repeat-candidate-genera\ tor no --dont-model-mapping-quality no --dont-protect-reference-haplotype no --downsample-above 1000 --downsample-target 500 --dropout-concentration \ 2 --duplicate-read-detection-policy AGGRESSIVE --extension-level NORMAL --fallback-kmer-gap 10 --fast no --filter-expression QUAL < 10 | MQ < 10 | MP\ < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 --forest-file germline.v0.7.0.forest --good-base-quality 20 --haplotype-holdout-threshold 2500 --hap\ lotype-overflow 200000 --ignore-unmapped-contigs no --indel-heterozygosity 0.0001 --keep-unfiltered-calls no --kmer-sizes[0] 10 --kmer-sizes[1] 15 --\ kmer-sizes[2] 20 --lagging-level NORMAL --mask-3prime-shifted-soft-clipped-heads no --mask-inverted-soft-clipping no --mask-soft-clipped-bases no --m\ ask-soft-clipped-boundary-bases 2 --max-assemble-region-overlap 200 --max-bubbles 30 --max-clones 3 --max-genotypes 5000 --max-haplotypes 400 --max-h\ oldout-depth 20 --max-indel-errors 32 --max-joint-genotypes 1000000 --max-open-read-files 250 --max-phylogeny-size 3 --max-read-length 10000 --max-re\ ference-cache-footprint 10GB --max-region-to-assemble 600 --max-somatic-haplotypes 2 --max-variant-size 2000 --max-vb-seeds 12 --min-bubble-score 2 -\ -min-candidate-credible-vaf-probability 0.75 --min-clone-frequency 0.01 --min-credible-somatic-frequency 0.01 --min-denovo-posterior 3 --min-expected\ -somatic-frequency 0.03 --min-forest-quality 8 --min-good-bases 20 --min-kmer-prune 2 --min-mapping-quality 20 --min-phase-score 10 --min-pileup-base\ -quality 20 --min-protected-haplotype-posterior 100 --min-refcall-posterior 2 --min-somatic-posterior 0.5 --min-variant-posterior 1 --model-posterior\ UnknownType(N7octopus7options20ModelPosteriorPolicyE) --no-adapter-contaminated-reads no --no-reads-with-decoy-supplementary-alignments no --no-read\ s-with-distant-segments no --no-reads-with-unmapped-segments no --no-reads-with-unplaced-or-unlocalized-supplementary-alignments no --normal-contamin\ ation-risk LOW --num-fallback-kmers 10 --one-based-indexing no --organism-ploidy 2 --output NA24385.40x.S7508.octopus.tmp.vcf.gz --read-linkage PAIRE\ D --reads[0] NA24385.40x.S7508.aligned.deduped.sort.bam --refcall-block-merge-quality 10 --refcall-filter-expression QUAL < 2 | GQ < 20 | MQ < 10 | D\ P < 10 | MF > 0.2 --reference human_g1k_v37_modified.fasta --regions-file regions.bed --resolve-symlinks no --sequence-error-model /home/kcibul/wgs_r\ esources/data/reference/octopus_err_models/novaseq.4a38e55.model --sites-only no --snp-heterozygosity 0.001 --snp-heterozygosity-stdev 0.01 --somatic\ -credible-mass 0.9 --somatic-filter-expression QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 |\ FRF > 0.5 --somatic-indel-mutation-rate 1e-06 --somatic-snv-mutation-rate 0.0001 --somatics-only no --split-long-reads no --target-read-buffer-footp\ rint 38.4kB --temp-directory-prefix octopus-temp --threads 192 --tumour-germline-concentration 1.5 --use-filtered-source-candidates no --use-independ\ ent-genotype-priors no --use-preprocessed-reads-for-filtering no --use-same-read-profile-for-all-samples no --use-uniform-genotype-priors no --varian\ t-discovery-mode ILLUMINA --very-fast no">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FILTER=

FILTER=

FILTER=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA24385.40x.S7508

1 10583 . G A 201.31 PASS RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=1;A\ N=2;DP=65;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583:100:16:76:0.211:0.074:37:0.198:0:0.000\ :0.244:2.76:RF 1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 356.94 RF;RF8.6 RFQUAL_ALL=4.51;STR_PERIOD=5;STR_LENGTH=12;QD=15.519;MQ0=25;MQ=30.110\ ;GC=0.710;CRF=0.222;AC=2;AN=2;DP=23;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:23:42:10583:100:81:81:1.00\ 0:0.580:.:0.716:47:0.580:.:4.51:RF 1 12783 . G A 1271 RF;RF8.6 RFQUAL_ALL=7.23;STR_PERIOD=7;STR_LENGTH=15;QD=18.420;MQ0=353;MQ=15.787;GC=0.588;CRF=0\ .000;AC=2;AN=2;DP=69;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:69:24:12783:100:150:150:1.000:0.000:37:0.6\ 08:0:0.000:.:7.23:RF `

I ran pyvcf on this VCF file, and the FORMAT AD field appears in it. I ran duphold on VCF files generated by Dragen and Sentieon, they both return the same error.

Any advice for how to fix this? I am eager to apply duphold to a clinical product I am developing, but this roadblock has me blocked.

Thanks! John Major

brentp commented 4 years ago

hi, you can run duphold without a snp vcf and just find changes in depth in your SVs. your SNP vcf has:

##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

which does not follow spec. It should be:

##INFO=<ID=AD,Number=R,Type=Integer,Description="allele depths">
johnemajor commented 4 years ago

My SNP VCFs actually have the AD field in the FORMAT section, not the INFO section. AD should be in the FORMAT section as it could be used for a multi-sample VCF.....

But, I tried hacking my snp VCF to have the INFO header you name above and moving the AD field to INFO, I get the same error.

brentp commented 4 years ago

shoot. I mean FORMAT, not INFO. what does your VCF have for AD in FORMAT?

brentp commented 4 years ago

oh, I see your AD is:

##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

should be Number=R, type=Integer and describe the depths for each allele.

johnemajor commented 4 years ago

Actually, AD is specified as both an INFO field AND a FORMAT field. The INFO version is presumably the sum of all the FORMAT ones, although I don’t see why that’s particularly useful.

johnemajor commented 4 years ago

But, yes, mine is in FORMAT. And it should appear as :

FORMAT= . ?

brentp commented 4 years ago

the FORMAT is the one that duphold uses and yes, that is correct. However, you can't simply change the header, you'll have to adjust the values for every record as well (or use a VCF from another caller)

johnemajor commented 4 years ago

The variant FORMAT record AD fields contain integers.... how would they need to be adjusted?

1 10583 . G A 201.31 RF;RF8.6 RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=\ 1;AN=2;DP=65;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583: 100:16:76:0.211:0.074:37:0.198:0:0.000:0.244:2.7\ 6:RF

The AD entry is a valid integer.....

brentp commented 4 years ago

AD should have multiple values for each samples. for a bi-allelic variant, it should have 2 values, the first indicates the number of reads supporting the reference allele and the 2nd indicates the number of reads supporting the alternate.

I still suggest to run duphold without the snp vcf as it's not needed and doesn't add much for most cases.

johnemajor commented 4 years ago

Well, it appears that just changing the header did the trick. Duphold is running now that I modified the snp.vcf to have

FORMAT=

brentp commented 4 years ago

it won't give you correct results.

johnemajor commented 4 years ago

ok- will give it a shot Thank You