Training for PacBio HiFi indel calling

proteinosome commented 1 year ago

Hi, may I know if you're planning to enable somatic indel calls with PacBio HiFi? If not, would you be able to provide an instruction on how to train our own model for ClairS with the latest GIAB data from Revio? They're available here: https://downloads.pacbcloud.com/public/revio/2022Q4/

Thank you.

aquaskyline commented 1 year ago

An experimental model trained with Sequel II HIFI data is available at https://github.com/HKU-BAL/ClairS#pre-trained-models. You might want to give it a test.

proteinosome commented 1 year ago

Thanks @aquaskyline I saw that model. I was just wondering if it supports indel calls since it's mentioned that currently indel calling is only supported for the ONT F10 model.

aquaskyline commented 1 year ago

No, the HiFi model doesn't support Indel calling at the moment. I don't have any real HiFi cancer data, thus unable to tell how the model performs on SNV, let alone the more challenging somatic indels. If you can do any benchmarks on the HiFi model on your side, kindly let us know the performance. If SNV calling works, I will move on to indel calling.

proteinosome commented 1 year ago

@aquaskyline Sure, I will let you know how it goes. Thank you.

proteinosome commented 1 year ago

One quick question, I noticed that for the ONT benchmark set there's 31444 SNV > 0.05, but when I took the truth set and filtered with TVAF there's 38351 entries:

seqc2_benchmark_set bcftools filter -i 'TVAF>=0.05' high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | wc -l
38351

May I check with you how did you subset the variants to those 31444 SNVs? Thank you.

aquaskyline commented 1 year ago

@zhengzhenxian

proteinosome commented 1 year ago

Thanks @aquaskyline . While waiting for @zhengzhenxian to get back, I tried ClairS on a set of 30X tumor/normal HCC1395 HiFi dataset and see the following results (with the full 39447 SNVs):

python ~/softwares/ClairS/clairs.py compare_vcf \
    --truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
    --input_vcf_fn output.vcf.gz \
    --bed_fn High-Confidence_Regions_v1.2.bed \
    --output_dir $(pwd)/hifi_benchmark \
    --input_filter_tag 'PASS'

[INFO] Total input records: 26585, truth records: 39447, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9868       0.665        0.7946       26234        351          13213

Removing the "PASS" filter:

[INFO] Total input records: 29954, truth records: 39447, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9504       0.7217       0.8204       28468        1486         10979

It looks like precision wise ClairS is performing pretty well, but there is a lot of false-negatives that's bringing recall down. I wonder if there's any parameter we can tune or is that something that requires re-training (happy to collaborate with you).

I should also mention that increasing tumor coverage to 60X (normal at 40X) showed similar metrics, so I don't think it's a matter of coverage:

# With PASS filter
[INFO] Total input records: 26268, truth records: 39447, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9908       0.6598       0.7921       26027        241          13420

# Without PASS filter
[INFO] Total input records: 29522, truth records: 39447, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9625       0.7203       0.824        28415        1107         11032

aquaskyline commented 1 year ago

We were not using TVAF in the SEQC2 VCF for filtering. There are multiple TVAF values in the VCF, but instead of using them, we used the TVAF calculated in our own benchmarking datasets.

I've just updated the README, specifically the "Performance figures" section, to make it clearer (my previous writing might have caused your confusion).

For the Illumina dataset, now it shows

50-fold HCC1395 (tumor) and 40-fold HCC1395BL (normal) of Illumina NovaSeq 6000 data
Truth: 39,447 high confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

For the ONT dataset, now it shows

70-fold HCC1395 (tumor) and 45-fold HCC1395BL (normal) of ONT R10.4.1 data
Truth: 31,444 high confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

We think that the use of platform-specific TVAF helps to better reveal the algorithm performance itself across platforms. To do so, we used the command below

pypy3 clairs.py compare_vcf \
    --truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
    --input_vcf_fn output.vcf.gz \
    --bed_fn High-Confidence_Regions_v1.2.bed \
    --output_dir benchmark \
    --input_filter_tag 'PASS' \
    --normal_bam_fn $NORMAL_BAM \
    --tumor_bam_fn $TUMOR_BAM \
    --min_af 0.05 \
    --threads 48 \
    --output_best_f1_score \
    --debug   #Output all quality score cut-off

For 30x HIFI data, my wild guess is the number of SNVs with TVAF≥0.05 will be somewhere around 35k, which is lower than 39k, thus will bump up the recall rate.

proteinosome commented 1 year ago

Here's the performance with the AF filter as per instruction (I changed input_filter_tag to None just to see the performance at all cut-offs). It looks like at 60X there's a sharp increase in FN and a modest improvement in FP.

Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation? I think this is a reasonably performance, thou, as the absolute number of TP and FP look pretty good. And this is using Sequel model on Revio, so I think there's also going to be some improvement in that with a new model?

30X tumor 30X normal

[INFO] Total input records: 29950, truth records: 34457, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9505       0.826        0.8839       28463        1482         5994
SNV(Best F1) 0.9781       0.8215       0.8930       28305        633          6152         34457

5            0.9781       0.8215       0.8930       28305        633          6152         34457
4            0.9770       0.8220       0.8929       28325        666          6132         34457
3            0.9763       0.8223       0.8927       28334        689          6123         34457
2            0.9758       0.8224       0.8926       28338        702          6119         34457
1            0.9757       0.8224       0.8925       28338        707          6119         34457
6            0.9803       0.8160       0.8906       28118        566          6339         34457
0            0.9505       0.8260       0.8839       28463        1482         5994         34457
7            0.9838       0.7911       0.8770       27260        448          7197         34457
8            0.9868       0.7614       0.8596       26236        351          8221         34457
9            0.9895       0.7278       0.8387       25079        266          9378         34457
10           0.9923       0.6888       0.8131       23733        185          10724        34457
11           0.9945       0.6399       0.7787       22050        123          12407        34457
12           0.9969       0.5822       0.7351       20061        63           14396        34457
13           0.9983       0.5133       0.6780       17686        31           16771        34457
14           0.9988       0.4334       0.6045       14935        18           19522        34457
15           0.9993       0.3436       0.5114       11839        8            22618        34457
16           0.9998       0.2510       0.4012       8648         2            25809        34457
17           0.9996       0.1615       0.2780       5564         2            28893        34457
18           0.9997       0.0839       0.1549       2892         1            31565        34457
19           1.0000       0.0321       0.0623       1107         0            33350        34457
20           1.0000       0.0074       0.0147       256          0            34201        34457
21           1.0000       0.0007       0.0013       23           0            34434        34457
22           1.0000       0.0001       0.0001       2            0            34455        34457

60X tumor 41X normal:

[INFO] Total input records: 29516, truth records: 36713, records out of BED:0
Type         Precision    Recall       F1-score     TP           FP           FN
SNV          0.9627       0.7739       0.858        28413        1101         8300
SNV(Best F1) 0.9844       0.7700       0.8641       28268        448          8445         36713

5            0.9844       0.7700       0.8641       28268        448          8445         36713
4            0.9834       0.7703       0.8639       28280        478          8433         36713
3            0.9827       0.7705       0.8637       28286        499          8427         36713
2            0.9822       0.7705       0.8636       28287        512          8426         36713
6            0.9862       0.7667       0.8627       28148        395          8565         36713
0            0.9627       0.7739       0.8580       28413        1101         8300         36713
7            0.9886       0.7416       0.8474       27225        314          9488         36713
8            0.9908       0.7090       0.8266       26030        241          10683        36713
9            0.9928       0.6749       0.8036       24779        180          11934        36713
10           0.9940       0.6329       0.7734       23237        141          13476        36713
11           0.9960       0.5818       0.7346       21360        85           15353        36713
12           0.9973       0.5262       0.6889       19319        53           17394        36713
13           0.9980       0.4614       0.6310       16938        34           19775        36713
14           0.9988       0.3920       0.5630       14391        17           22322        36713
15           0.9990       0.3162       0.4804       11610        12           25103        36713
16           0.9998       0.2319       0.3764       8513         2            28200        36713
17           1.0000       0.1589       0.2742       5832         0            30881        36713
18           1.0000       0.0968       0.1766       3555         0            33158        36713
19           1.0000       0.0488       0.0930       1790         0            34923        36713
20           1.0000       0.0150       0.0295       550          0            36163        36713
21           1.0000       0.0015       0.0030       55           0            36658        36713
22           1.0000       0.0001       0.0002       4            0            36709        36713
23           1.0000       0.0000       0.0001       1            0            36712        36713

aquaskyline commented 1 year ago

Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation

It is more likely because the HiFi training data we used to train the Sequel 2 model is just ~~~30-40x~~ HG003 54x, HG004 52x. The model was not sufficiently trained with higher coverage samples like 60x, thus making more mistakes. ClairS supports tumor raw coverage up to ~80x, and will downsample the input on-the-fly to ~80x if exceeded. To solve the problem, we can either increase the training data coverage to 80x or above, or tune ClairS to downsample HiFI data to a lower coverage say 30x. Obviously, the former produces better precision and recall.

proteinosome commented 1 year ago

@aquaskyline Thanks for your comment. That makes sense. I guess at this stage there's not much I can do from my end, but I should mention that I will be happy to help with the need for data and benchmarking if that's of interest to your group. Feel free to email me directly at kpin@pacb.com moving forward. We're hoping to plug the gap in small variants calling for HiFi cancer dataset as soon as possible :)

HKU-BAL / ClairS

Training for PacBio HiFi indel calling #9