bcgsc / ntEdit

✏️ Genome assembly polishing & SNV detection
GNU General Public License v3.0
64 stars 9 forks source link

VCF output inconsistencies #17

Closed warrenlr closed 3 years ago

warrenlr commented 3 years ago

@jowong4 inconsistencies observed:

1) the changes.tsv file does not always report the same information as the variants.vcf See below example on an E. coli run: tsv reports a single variant (G) at position: 20311 whereas the vcf reports G,A

2) DP info in vcf propagated throughout

whenever an alternate base is reported in the vcf, DP takes the form XX,XX indicating "depth" for both variant. But it seems the alternate depth is always displayed after that, even if no alternate bases are detected. This could be an error of variable scope when tracking alternate base depth count. Would need to be reset each time (see below example, positions 20732 and 21955).

3) Genotype (INTEGRATION) reported might be incorrect

Encountering an alternate base (2) appears to set off incorrect DP and Genotypes integration eg. positions 20732 and 21955 are inconsistent between vcf and tsv

Example E. coli run

/usr/bin/time -v -o v134snpNEWdatastruct.time ./ntedit_v1.3.4/ntEdit/ntedit -f copyIUPAC.fa -r solidBF_k30.bf -b nteditk30-iupac-v134dev-SNP -s 1
nteditk30-iupac-v134dev-SNP_changes.tsv
...
U00096.3_MG1655_k12     19557   G       A       10
U00096.3_MG1655_k12     19861   A       T       10      A       10
U00096.3_MG1655_k12     19934   G       A       10      G       10
U00096.3_MG1655_k12     20121   A       C       10      A       10
U00096.3_MG1655_k12     20311   C       G       10
U00096.3_MG1655_k12     20438   G       A       10      G       10
U00096.3_MG1655_k12     20515   C       A       10      C       10
U00096.3_MG1655_k12     20732   C       T       10
U00096.3_MG1655_k12     21955   A       G       10
...

nteditk30-iupac-v134dev-SNP_variants.vcf
...
U00096.3_MG1655_k12     19557   .       G       A       .       PASS    DP=10   GT      1/1
U00096.3_MG1655_k12     19861   .       A       T       .       PASS    DP=10,10        GT      0/1
U00096.3_MG1655_k12     19934   .       G       A       .       PASS    DP=10,10        GT      0/1
U00096.3_MG1655_k12     20121   .       A       C       .       PASS    DP=10,10        GT      0/1
U00096.3_MG1655_k12     20311   .       C       G,A     .       PASS    DP=10,10        GT      1/2
U00096.3_MG1655_k12     20438   .       G       A       .       PASS    DP=10,10        GT      0/1
U00096.3_MG1655_k12     20732   .       C       T       .       PASS    DP=10,10        GT      0/1
U00096.3_MG1655_k12     21955   .       A       G       .       PASS    DP=10,10        GT      0/1
...
lcoombe commented 3 years ago

Just wondering -- is the form DP=XX,XX conventional? I'm used to DP just being a single integer: Ex:

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
scaf820 32246   .       G       A,C     195     .       DP=241;VDB=0.732781;SGB=-0.693147;RPB=0.102066;MQB=0.571018;MQSB=0.457093;BQB=0.368693;MQ0F=0.0207469;AC=1,1;AN=2;DP4=9,12,160,46;MQ=53 GT:PL:DP:SP:ADF:ADR:AD  1/2:255,230,255,142,0,255:227:29:9,114,46:12,29,17:21,143,63
warrenlr commented 3 years ago

https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format

DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

ntedit reports the number of qualifying kmers overlapping the base under scrutiny, for each alternate substitutions. Some will be allelic variants, but often are base variations detected due to repeats. At any rate, the XX,XX is meant to capture the coverage provided by each alternate base subs (unfortunately not the depth since ntedit does not use counting Bloom filters)

lcoombe commented 3 years ago

I see -- I guess I thought based on examples on my end and the documentation that DP= in the INFO field was conventionally an integer, but I could be wrong there. From the documentation above, they use AD to indicate depth at alternate alleles, and DP is the single summary integer:

At this site, the called genotype is GT = 0/1, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99 is very good; there were a total of 33 informative reads at this site (DP=33), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15).

Not a big deal, just mentioning it if we want to stick closer to (my understanding of) the conventions. (even knowing that it isn't technically 'depth', but the ntEdit proxy for it)

warrenlr commented 3 years ago

that's an important distinction, thank you for flagging @lcoombe ! Your interpretation makes sense, and appears to be the intended meaning of those metrics.

so we should at minimum replace "DP" by "AD" @jowong4 and please let's look into the inconsistencies

And a general comment related to ntedit's vcf output: the difficulty remains; how to interpret the AD metric when reporting on kmers instead of reads