TimD1 / vcfdist

vcfdist: Accurately benchmarking phased variant calls
GNU General Public License v3.0
76 stars 7 forks source link

Inconsistent classification of same variant #29

Closed mbhall88 closed 4 months ago

mbhall88 commented 4 months ago

Me again 😬

I have a curious case where the same variants from two different variant callers gives a TP for one and a FP for the other...

Here is the variant (with context) from Clair3 with the vcfdist assessments

chromosome      3462134 .       G       C       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:10999:5:0:0:.:.  1:TP:1.000000:1:0:gm:24:10999:5:0:0:0:0
chromosome      3462135 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:10999:4:0:0:.:.  1:TP:1.000000:1:0:gm:16:10999:4:0:0:0:0
chromosome      3462136 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:10999:3:0:0:.:.  1:TP:1.000000:1:0:gm:23:10999:3:0:0:0:0
chromosome      3462139 .       G       GGT     .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:10999:.:.:0:.:.   1:TP:1.000000:5:0:gm:17:10999:2:0:0:0:0
chromosome      3462140 .       T       G       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:.  .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome      3462141 .       A       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:.  .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome      3462142 .       T       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:10999:.:.:0:.:.   1:TP:1.000000:5:0:gm:16:10999:2:0:0:0:0
chromosome      3462143 .       T       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:.  .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome      3462144 .       ACC     A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:10999:.:.:0:.:.   1:TP:1.000000:5:0:gm:25:10999:2:0:0:0:0
chromosome      3462145 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:.  .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome      3462146 .       C       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:.  .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome      3462151 .       A       G       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:10999:1:0:0:.:.  1:TP:1.000000:1:0:gm:16:10999:1:0:0:0:0

and the same variants from DeepVariant

chromosome      3462134 .       G       C       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:11000:5:0:0:.:.  1:TP:1.000000:1:0:gm:20:11000:5:0:0:0:0
chromosome      3462135 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:11000:4:0:0:.:.  1:TP:1.000000:1:0:gm:12:11000:4:0:0:0:0
chromosome      3462136 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:11000:3:0:0:.:.  1:TP:1.000000:1:0:gm:11:11000:3:0:0:0:0
chromosome      3462139 .       G       GGT     .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:11000:.:.:0:.:.   1:FP:0.600000:5:2:lm:3:11000:2:0:0:0:0
chromosome      3462140 .       T       G       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:.  .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome      3462141 .       A       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:.  .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome      3462142 .       T       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:11000:.:.:0:.:.   1:FP:0.600000:5:2:lm:22:11000:2:0:0:0:0
chromosome      3462143 .       T       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:.  .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome      3462145 .       C       T       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:.  .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome      3462146 .       C       A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:.  .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome      3462151 .       A       G       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  1:TP:1.000000:1:0:gm:0:11000:1:0:0:.:.  1:TP:1.000000:1:0:gm:14:11000:1:0:0:0:0

the truth variants for this region are

chromosome      3462134 2abfe98a        G       C       .       PASS    .       GT      1
chromosome      3462135 75aed307        C       T       .       PASS    .       GT      1
chromosome      3462136 9c7391b9        C       T       .       PASS    .       GT      1
chromosome      3462140 60d3b9d8        T       G       .       PASS    .       GT      1
chromosome      3462141 a08fd538        A       T       .       PASS    .       GT      1
chromosome      3462143 f3d80833        T       A       .       PASS    .       GT      1
chromosome      3462145 b8c36e5e        C       T       .       PASS    .       GT      1
chromosome      3462146 f6f2bac0        C       A       .       PASS    .       GT      1
chromosome      3462151 ecc19214        A       G       .       PASS    .       GT      1

You can see in the clair3 VCF that there are positions marked at TP when there is no QUERY variant and vice versa. This doesn't seem to make a lot of sense? Then the same positions for DeepVariant are (correctly) marked as FPs and FNs.

I have attached two tarballs with the necessary data to reproduce this with the following command

vcfdist ATCC_10708__202309.100x.clair3.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 --credit-threshold 1.0 -p c3_ATCC_10708. -b ATCC_10708__202309.bed -mx 99

clair3_bug_data.tar.gz dv_bug_data.tar.gz

TimD1 commented 4 months ago

This may initially seem like a bug, but vcfdist is correct and behaving exactly as intended in the example above. Note that the Clair3 VCF contains an additional variant not present in the DeepVariant VCF:

chromosome      3462144 .       ACC     A       .       PASS    .       GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE  .:.:.:.:.:.:.:10999:.:.:0:.:.   1:TP:1.000000:5:0:gm:25:10999:2:0:0:0:0

For both VCFs, the cluster of variants in the range 346,2139-346,2146 are considered to be part of a complex variant (because they are in the same supercluster (SC=11000) and sync group (SG=2)). For the Clair3 VCF (but not the DeepVariant VCF), the set of variants in that region results in the same sequence as the Truth VCF, and causes all variants to be marked as TP.

For the bases 3462139-3462146 (inclusive): Reference Sequence: GTATTACC Truth VCF Sequence: GGTTAATA Clair3 VCF Sequence: GGTTAATA DeepVariant Sequence: GGTTAATACC

mbhall88 commented 4 months ago

Ahhh that is a MUCH better way of looking at it. It's so easy to get bogged down in looking at the individual variants and forget to zoom out a little and see the overall sequence. This is why I love vcfdist.

Thanks for the detailed response.