Closed warrenlr closed 3 years ago
Just wondering -- is the form DP=XX,XX
conventional? I'm used to DP
just being a single integer:
Ex:
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
scaf820 32246 . G A,C 195 . DP=241;VDB=0.732781;SGB=-0.693147;RPB=0.102066;MQB=0.571018;MQSB=0.457093;BQB=0.368693;MQ0F=0.0207469;AC=1,1;AN=2;DP4=9,12,160,46;MQ=53 GT:PL:DP:SP:ADF:ADR:AD 1/2:255,230,255,142,0,255:227:29:9,114,46:12,29,17:21,143,63
https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
ntedit reports the number of qualifying kmers overlapping the base under scrutiny, for each alternate substitutions. Some will be allelic variants, but often are base variations detected due to repeats. At any rate, the XX,XX is meant to capture the coverage provided by each alternate base subs (unfortunately not the depth since ntedit does not use counting Bloom filters)
I see -- I guess I thought based on examples on my end and the documentation that DP=
in the INFO field was conventionally an integer, but I could be wrong there.
From the documentation above, they use AD
to indicate depth at alternate alleles, and DP
is the single summary integer:
At this site, the called genotype is GT = 0/1, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99 is very good; there were a total of 33 informative reads at this site (DP=33), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15).
Not a big deal, just mentioning it if we want to stick closer to (my understanding of) the conventions. (even knowing that it isn't technically 'depth', but the ntEdit proxy for it)
that's an important distinction, thank you for flagging @lcoombe ! Your interpretation makes sense, and appears to be the intended meaning of those metrics.
so we should at minimum replace "DP" by "AD" @jowong4 and please let's look into the inconsistencies
And a general comment related to ntedit's vcf output: the difficulty remains; how to interpret the AD metric when reporting on kmers instead of reads
@jowong4 inconsistencies observed:
1) the changes.tsv file does not always report the same information as the variants.vcf See below example on an E. coli run: tsv reports a single variant (G) at position: 20311 whereas the vcf reports G,A
2) DP info in vcf propagated throughout
whenever an alternate base is reported in the vcf, DP takes the form XX,XX indicating "depth" for both variant. But it seems the alternate depth is always displayed after that, even if no alternate bases are detected. This could be an error of variable scope when tracking alternate base depth count. Would need to be reset each time (see below example, positions 20732 and 21955).
3) Genotype (INTEGRATION) reported might be incorrect
Encountering an alternate base (2) appears to set off incorrect DP and Genotypes integration eg. positions 20732 and 21955 are inconsistent between vcf and tsv
Example E. coli run