HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
238 stars 27 forks source link

Incorrect SNP representation when it overlaps a deletion #272

Open ymcki opened 7 months ago

ymcki commented 7 months ago

When I was trying to run vcfeval with the same vcf generated by Clair3 using the following command, I find that I am getting 3518 out of 5440881 FPs/FNs

./rtg vcfeval -t ~/genome/hs38.sdf -b sample.vcf.gz -c sample.vcf.gz --ref-overlap -o sample

Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure

2.000            5440881        5440881       3518       3518     0.9994       0.9994     0.9994
 None            5440881        5440881       3518       3518     0.9994       0.9994     0.9994

Upon looking at the FPs/FNs, I noticed that quite many of them are caused by incorrect SNP calls when a deletion overlaps an SNP. For example: chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR001737775.1|pseudogene||n.*31573160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268 chr1 4620715 . T A 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:39,18:0.3103:.

By loading haplotagged bam to IGV, I can see that the deletion is on one haplotype and the SNP is on the other haplotype.. So the correct call should be 1|0 for the deletion and 1|2 for the SNP where 2 is for the missing allele * based on VCF spec. https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele

So the corrected calls should be: chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR001737775.1|pseudogene||n.*31573160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268 chr1 4620715 . T A, 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:39,18:0.3103:.

and vcfeval doesn't flag them as FPs/FNs.

By assuming the hetero deletion calls are correct, I fixed the SNP calls with the following awk commands, awk '!/^#/&&length($4)>length($5){split($10,a,":");d=length($4)-length($5);if(a[1]=="0|1"||a[1]=="1|0"||a[1]=="0/1"){for(i=1;i<=d;++i){printf "%s\t%d\t%s\n",$1,$2+i,a[1]}}}' sample.vcf > sample.del awk 'BEGIN{b["0|1"]="1|2";b["1|0"]="2|1";b["0/1"]="1/2";while(getline<"sample.del"){a[sprintf("%s:%d",$1,$2)]=b[$3]}}/^#/{print}!/^#/&&length($4)!=length($5){print}!/^#/&&length($4)==length($5){p=sprintf("%s:%d",$1,$2);if(a[p]!=""){$5=sprintf("%s,*",$5);$10=sprintf("%s%s",a[p],substr($10,4))}OFS="\t";print}' sample.vcf > sample_fixed.vcf

The number of FPs/FNs is reduced to 942 which improves precision/sensitivity from 99.9354% to 99.9827%

In the remaining 942, I noticed that 295 of them are caused by overlapping homo deletion and homo SNP calls. The rest are more complex calls.

It would be great if Clair3 can fix these problems in the future release. It will help greatly in benchmarking and also reducing the false homo SNP calls that are often ignored by clinicians. Thank you very much for your time.

aquaskyline commented 7 months ago

A fix will be in the next version (~v1.0.6~~v1.0.7~TBD)

To double confirm the correct calls should be: chr1 4620711 . AAAAT A 6.41 PASS 0/1 chr1 4620715 . T A,* 7.2 PASS 1/2

Correct?

ymcki commented 7 months ago

A fix will be in the next version (v1.0.6)

To double confirm the correct calls should be: chr1 4620711 . AAAAT A 6.41 PASS 0/1 chr1 4620715 . T A,* 7.2 PASS 1/2

Correct?

Yes, this should make vcfeval happy.

aquaskyline commented 7 months ago

Many thanks for the detailed analysis and report.