Open johnemajor opened 4 years ago
hi, you can run duphold without a snp vcf and just find changes in depth in your SVs. your SNP vcf has:
##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
which does not follow spec. It should be:
##INFO=<ID=AD,Number=R,Type=Integer,Description="allele depths">
My SNP VCFs actually have the AD field in the FORMAT section, not the INFO section. AD should be in the FORMAT section as it could be used for a multi-sample VCF.....
But, I tried hacking my snp VCF to have the INFO header you name above and moving the AD field to INFO, I get the same error.
shoot. I mean FORMAT, not INFO. what does your VCF have for AD in FORMAT?
oh, I see your AD is:
##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
should be Number=R, type=Integer
and describe the depths for each allele.
Actually, AD is specified as both an INFO field AND a FORMAT field. The INFO version is presumably the sum of all the FORMAT ones, although I don’t see why that’s particularly useful.
But, yes, mine is in FORMAT. And it should appear as :
the FORMAT is the one that duphold uses and yes, that is correct. However, you can't simply change the header, you'll have to adjust the values for every record as well (or use a VCF from another caller)
The variant FORMAT record AD fields contain integers.... how would they need to be adjusted?
1 10583 . G A 201.31 RF;RF8.6 RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=\ 1;AN=2;DP=65;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583: 100:16:76:0.211:0.074:37:0.198:0:0.000:0.244:2.7\ 6:RF
The AD entry is a valid integer.....
AD should have multiple values for each samples. for a bi-allelic variant, it should have 2 values, the first indicates the number of reads supporting the reference allele and the 2nd indicates the number of reads supporting the alternate.
I still suggest to run duphold without the snp vcf as it's not needed and doesn't add much for most cases.
Well, it appears that just changing the header did the trick. Duphold is running now that I modified the snp.vcf to have
it won't give you correct results.
ok- will give it a shot Thank You
I'm running duphold v 0.2.1
The command I am running is: ~/install_crap/duphold/duphold -v ./results.vcf.gz -b /efs/WGS/data/WGS/ILMN_exptA/b37/KC_downsampledBAM_and_VCF/NA24385/40x/S7508/NA24385.40x.S7508.aligned.deduped.sort.bam -f /efs/WGS/data/reference/human/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta -s ./x.vcf.gz -t 96 -o ./duphold.vcf
It spins for a while, then returns the error: "expected AD field in snps VCF"
the thing is, my VCF files have the AD field present..... I created a small vcf to test this more directly, and here are the full contents of the x.vcf.gz
`##fileformat=VCFv4.3
reference=human_g1k_v37_modified
octopus=<version=0.6.3-beta_HEAD_5961a546,command="octopus --reference /home/kcibul/wgs_resources/data/reference/human/human_g1k_v37_modified.fasta\
/human_g1k_v37_modified.fasta --reads NA24385.40x.S7508.aligned.deduped.sort.bam -t regions.bed --forest-file /home/kcibul/wgs_resources/data/referen\ ce/forests/DC/germline.v0.7.0.forest -o NA24385.40x.S7508.octopus.tmp.vcf.gz --threads 192 -X 10000MB -B 38400 MB --sequence-error-model /home/kcibul\ /wgs_resources/data/reference/octopus_err_models/novaseq.4a38e55.model --annotations AD ADP AF ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD QUAL SB STR_LE\ NGTH STR_PERIOD --max-indel-errors 32 --duplicate-read-detection-policy AGGRESSIVE --max-haplotypes=400 --min-forest-quality=8",options="--allow-mark\ ed-duplicates no --allow-octopus-duplicates no --allow-pileup-candidates-from-likely-misaligned-reads no --allow-qc-fails no --allow-reads-with-good-\ decoy-supplementary-alignments no --allow-reads-with-good-unplaced-or-unlocalized-supplementary-alignments no --allow-secondary-alignments no --allow\ -supplementary-alignments no --annotations[0] AD --annotations[1] ADP --annotations[2] AF --annotations[3] ARF --annotations[4] BQ --annotations[5] C\ RF --annotations[6] DP --annotations[7] FRF --annotations[8] GC --annotations[9] GQ --annotations[10] MC --annotations[11] MF --annotations[12] MQ --\ annotations[13] MQ0 --annotations[14] QD --annotations[15] QUAL --annotations[16] SB --annotations[17] STR_LENGTH --annotations[18] STR_PERIOD --asse\ mble-all no --assembler-mask-base-quality 10 --backtrack-level NONE --bad-region-tolerance NORMAL --bamout-type MINI --caller population --consider-u\ nmapped-reads no --contig-output-order REFERENCE_INDEX --contig-ploidies[0] Y=1 --contig-ploidies[1] chrY=1 --contig-ploidies[2] MT=1 --contig-ploidi\ es[3] chrM=1 --denovo-filter-expression QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF \
I ran pyvcf on this VCF file, and the FORMAT AD field appears in it. I ran duphold on VCF files generated by Dragen and Sentieon, they both return the same error.
Any advice for how to fix this? I am eager to apply duphold to a clinical product I am developing, but this roadblock has me blocked.
Thanks! John Major