show more allele counts in INFO

hoyonh commented 1 month ago

I have enjoyed using Clair3 (clair3-arm64) for examining natural variants at the whole-genome level. However, I've recently noticed some irregularities that likely need correction.

Specifically, I have observed a large number of variants that Clair3 could recognize as variants with GT=1/1 and AF=1 but does not. Instead, these variants are labeled (in my opinion, mislabeled) as either RefCalls with AF=0 or variants with GT=0/1 and AF=1.

I have examined 15 Illumina samples aligned with bwa-mem. The number of RefCalls with AF=0 ranges from over 100 to over 7000, along with smaller numbers of called variants with GT=0/1 and AF=1. When examining these using IGV, they typically appear as 100% mutant (ALT).

Additionally, RefCalls with very low AF values above 0 should probably be labeled as variants as well.

aquaskyline commented 1 month ago

AF is just one of the many factors Clair3 uses for deciding a candidate is or is not a variant. Yes that among all the Clair3 called variants in a full genome, there will be some variants with very high VAF but called as 0/0, and some with very low VAF but called as 1/1. High VAF could be a variant, or sequencing error, or alignment error, or statistical bias at lower coverages, etc. If high VAF itself is enough to justify a variant, writing a variant caller would be so much easier.

Clair3 still makes mistakes, but Clair3 was found to be good at finding a variant when the observed VAF is low, and denying a false positive one even when the VAF is like 1. Whether a decision by Clair3 is correct or not can be orthogonally validated by a quorum of other variant callers, or by wet-lab experiments.

Out of my personal curiosity, we provided no easy installation option for clair3-arm64, except for a tedious guide for Mac with Apple Silicon. Were you using that? Are the GPUs involved in calculation? Was it running stably?

hoyonh commented 1 month ago

Thanks for the quick reply. To provide more context, I rely mainly on AF (VAF) and DP for my evaluations. These are Clair3-interpreted AF and DP values, which differ from the equivalents visible with IGV in bwa-mem BAM alignments. For a set of RefCalls with AF=0, the DP ranges from 2 to 242 (median = 16, 25% quantile = 6). Low DP is indeed more common. However, the presence of many RefCalls with high DP and low AF suggests that the threshold for variant calling might be too stringent.

While all variant callers make mistakes, I rate Clair3 and DeepVariant as the best. I find it easier to work with Clair3's output compared to DeepVariant. Evaluating variant calls by Clair3 using IGV is more straightforward in general, but RefCall evaluations pose challenges because no ALT calls are provided. It would be more convenient if ALT calls were made on more RefCalls with low AF values (i.e., all or the vast majority are different from REF).

I note that DeepVariant creates its own alignments. Consequently, the same DNA changes predicted by DeepVariant and Clair3 can appear quite different when examining VCF, making it harder to make accurate comparisons with Clair3. I also find it more difficult to make sense of their VAF and DP.

Validation by wet-lab experiments should be conducted for important cases.

With clair3-arm64, I followed the guide for Mac with Apple Silicon. Not all steps in the guide worked for me, but I found fixes for them (installing samtools-1.15.1 and htslib-1.15.1). I am using a first-generation M1 laptop, and the runs have been fast. I turned to this version because I couldn't find a way to successfully install and run Clair3 on our local HPC. I typically used 8 threads. I don't recall use of GPU; I used the standard settings and made no attempt at training. Here is a typical command (with abbreviated input filenames):

./run_clair3.sh --bam_fn=bamfile --ref_fn=ref --threads=8 --platform="ilmn" --model_path=$HOME/Clair3/models/ilmn --output=$HOME/Clair3/test --include_all_ctgs --samtools=$HOME/Clair3/samtools-1.15.1/samtools --pypy=$HOME/Clair3/pypy3.9-v7.3.8-osx64/bin/pypy --parallel=/opt/homebrew/bin/parallel --whatshap=$HOME/.local/bin/whatshap --longphase=$HOME/Clair3/longphase-1.3/longphase

aquaskyline commented 1 month ago

Understood. Maybe adding the following detail (which is given in ClairS and ClairS-TO) to Clair3 to facilitate downstream variant filtering or resurrection is useful.

            ##INFO=<ID=FAU,Number=1,Type=Integer,Description="Count of A in forward strand in the tumor BAM">
            ##INFO=<ID=FCU,Number=1,Type=Integer,Description="Count of C in forward strand in the tumor BAM">
            ##INFO=<ID=FGU,Number=1,Type=Integer,Description="Count of G in forward strand in the tumor BAM">
            ##INFO=<ID=FTU,Number=1,Type=Integer,Description="Count of T in forward strand in the tumor BAM">
            ##INFO=<ID=RAU,Number=1,Type=Integer,Description="Count of A in reverse strand in the tumor BAM">
            ##INFO=<ID=RCU,Number=1,Type=Integer,Description="Count of C in reverse strand in the tumor BAM">
            ##INFO=<ID=RGU,Number=1,Type=Integer,Description="Count of G in reverse strand in the tumor BAM">
            ##INFO=<ID=RTU,Number=1,Type=Integer,Description="Count of T in reverse strand in the tumor BAM">

HKU-BAL / Clair3

show more allele counts in INFO #320