Closed mbhall88 closed 4 months ago
This may initially seem like a bug, but vcfdist
is correct and behaving exactly as intended in the example above. Note that the Clair3 VCF contains an additional variant not present in the DeepVariant VCF:
chromosome 3462144 . ACC A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:10999:.:.:0:.:. 1:TP:1.000000:5:0:gm:25:10999:2:0:0:0:0
For both VCFs, the cluster of variants in the range 346,2139-346,2146
are considered to be part of a complex variant (because they are in the same supercluster (SC=11000
) and sync group (SG=2
)). For the Clair3 VCF (but not the DeepVariant VCF), the set of variants in that region results in the same sequence as the Truth VCF, and causes all variants to be marked as TP
.
For the bases 3462139-3462146
(inclusive):
Reference Sequence: GTATTACC
Truth VCF Sequence: GGTTAATA
Clair3 VCF Sequence: GGTTAATA
DeepVariant Sequence: GGTTAATACC
Ahhh that is a MUCH better way of looking at it. It's so easy to get bogged down in looking at the individual variants and forget to zoom out a little and see the overall sequence. This is why I love vcfdist.
Thanks for the detailed response.
Me again 😬
I have a curious case where the same variants from two different variant callers gives a TP for one and a FP for the other...
Here is the variant (with context) from Clair3 with the vcfdist assessments
and the same variants from DeepVariant
the truth variants for this region are
You can see in the clair3 VCF that there are positions marked at TP when there is no QUERY variant and vice versa. This doesn't seem to make a lot of sense? Then the same positions for DeepVariant are (correctly) marked as FPs and FNs.
I have attached two tarballs with the necessary data to reproduce this with the following command
clair3_bug_data.tar.gz dv_bug_data.tar.gz