Closed proteinosome closed 1 year ago
An experimental model trained with Sequel II HIFI data is available at https://github.com/HKU-BAL/ClairS#pre-trained-models. You might want to give it a test.
Thanks @aquaskyline I saw that model. I was just wondering if it supports indel calls since it's mentioned that currently indel calling is only supported for the ONT F10 model.
No, the HiFi model doesn't support Indel calling at the moment. I don't have any real HiFi cancer data, thus unable to tell how the model performs on SNV, let alone the more challenging somatic indels. If you can do any benchmarks on the HiFi model on your side, kindly let us know the performance. If SNV calling works, I will move on to indel calling.
@aquaskyline Sure, I will let you know how it goes. Thank you.
One quick question, I noticed that for the ONT benchmark set there's 31444 SNV > 0.05, but when I took the truth set and filtered with TVAF there's 38351 entries:
seqc2_benchmark_set bcftools filter -i 'TVAF>=0.05' high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | wc -l
38351
May I check with you how did you subset the variants to those 31444 SNVs? Thank you.
@zhengzhenxian
Thanks @aquaskyline . While waiting for @zhengzhenxian to get back, I tried ClairS on a set of 30X tumor/normal HCC1395 HiFi dataset and see the following results (with the full 39447 SNVs):
python ~/softwares/ClairS/clairs.py compare_vcf \
--truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
--input_vcf_fn output.vcf.gz \
--bed_fn High-Confidence_Regions_v1.2.bed \
--output_dir $(pwd)/hifi_benchmark \
--input_filter_tag 'PASS'
[INFO] Total input records: 26585, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9868 0.665 0.7946 26234 351 13213
Removing the "PASS" filter:
[INFO] Total input records: 29954, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9504 0.7217 0.8204 28468 1486 10979
It looks like precision wise ClairS is performing pretty well, but there is a lot of false-negatives that's bringing recall down. I wonder if there's any parameter we can tune or is that something that requires re-training (happy to collaborate with you).
I should also mention that increasing tumor coverage to 60X (normal at 40X) showed similar metrics, so I don't think it's a matter of coverage:
# With PASS filter
[INFO] Total input records: 26268, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9908 0.6598 0.7921 26027 241 13420
# Without PASS filter
[INFO] Total input records: 29522, truth records: 39447, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9625 0.7203 0.824 28415 1107 11032
We were not using TVAF
in the SEQC2 VCF for filtering. There are multiple TVAF
values in the VCF, but instead of using them, we used the TVAF calculated in our own benchmarking datasets.
I've just updated the README, specifically the "Performance figures" section, to make it clearer (my previous writing might have caused your confusion).
For the Illumina dataset, now it shows
For the ONT dataset, now it shows
We think that the use of platform-specific TVAF helps to better reveal the algorithm performance itself across platforms. To do so, we used the command below
pypy3 clairs.py compare_vcf \
--truth_vcf_fn high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz \
--input_vcf_fn output.vcf.gz \
--bed_fn High-Confidence_Regions_v1.2.bed \
--output_dir benchmark \
--input_filter_tag 'PASS' \
--normal_bam_fn $NORMAL_BAM \
--tumor_bam_fn $TUMOR_BAM \
--min_af 0.05 \
--threads 48 \
--output_best_f1_score \
--debug #Output all quality score cut-off
For 30x HIFI data, my wild guess is the number of SNVs with TVAF≥0.05 will be somewhere around 35k, which is lower than 39k, thus will bump up the recall rate.
Here's the performance with the AF filter as per instruction (I changed input_filter_tag to None just to see the performance at all cut-offs). It looks like at 60X there's a sharp increase in FN and a modest improvement in FP.
Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation? I think this is a reasonably performance, thou, as the absolute number of TP and FP look pretty good. And this is using Sequel model on Revio, so I think there's also going to be some improvement in that with a new model?
30X tumor 30X normal
[INFO] Total input records: 29950, truth records: 34457, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9505 0.826 0.8839 28463 1482 5994
SNV(Best F1) 0.9781 0.8215 0.8930 28305 633 6152 34457
5 0.9781 0.8215 0.8930 28305 633 6152 34457
4 0.9770 0.8220 0.8929 28325 666 6132 34457
3 0.9763 0.8223 0.8927 28334 689 6123 34457
2 0.9758 0.8224 0.8926 28338 702 6119 34457
1 0.9757 0.8224 0.8925 28338 707 6119 34457
6 0.9803 0.8160 0.8906 28118 566 6339 34457
0 0.9505 0.8260 0.8839 28463 1482 5994 34457
7 0.9838 0.7911 0.8770 27260 448 7197 34457
8 0.9868 0.7614 0.8596 26236 351 8221 34457
9 0.9895 0.7278 0.8387 25079 266 9378 34457
10 0.9923 0.6888 0.8131 23733 185 10724 34457
11 0.9945 0.6399 0.7787 22050 123 12407 34457
12 0.9969 0.5822 0.7351 20061 63 14396 34457
13 0.9983 0.5133 0.6780 17686 31 16771 34457
14 0.9988 0.4334 0.6045 14935 18 19522 34457
15 0.9993 0.3436 0.5114 11839 8 22618 34457
16 0.9998 0.2510 0.4012 8648 2 25809 34457
17 0.9996 0.1615 0.2780 5564 2 28893 34457
18 0.9997 0.0839 0.1549 2892 1 31565 34457
19 1.0000 0.0321 0.0623 1107 0 33350 34457
20 1.0000 0.0074 0.0147 256 0 34201 34457
21 1.0000 0.0007 0.0013 23 0 34434 34457
22 1.0000 0.0001 0.0001 2 0 34455 34457
60X tumor 41X normal:
[INFO] Total input records: 29516, truth records: 36713, records out of BED:0
Type Precision Recall F1-score TP FP FN
SNV 0.9627 0.7739 0.858 28413 1101 8300
SNV(Best F1) 0.9844 0.7700 0.8641 28268 448 8445 36713
5 0.9844 0.7700 0.8641 28268 448 8445 36713
4 0.9834 0.7703 0.8639 28280 478 8433 36713
3 0.9827 0.7705 0.8637 28286 499 8427 36713
2 0.9822 0.7705 0.8636 28287 512 8426 36713
6 0.9862 0.7667 0.8627 28148 395 8565 36713
0 0.9627 0.7739 0.8580 28413 1101 8300 36713
7 0.9886 0.7416 0.8474 27225 314 9488 36713
8 0.9908 0.7090 0.8266 26030 241 10683 36713
9 0.9928 0.6749 0.8036 24779 180 11934 36713
10 0.9940 0.6329 0.7734 23237 141 13476 36713
11 0.9960 0.5818 0.7346 21360 85 15353 36713
12 0.9973 0.5262 0.6889 19319 53 17394 36713
13 0.9980 0.4614 0.6310 16938 34 19775 36713
14 0.9988 0.3920 0.5630 14391 17 22322 36713
15 0.9990 0.3162 0.4804 11610 12 25103 36713
16 0.9998 0.2319 0.3764 8513 2 28200 36713
17 1.0000 0.1589 0.2742 5832 0 30881 36713
18 1.0000 0.0968 0.1766 3555 0 33158 36713
19 1.0000 0.0488 0.0930 1790 0 34923 36713
20 1.0000 0.0150 0.0295 550 0 36163 36713
21 1.0000 0.0015 0.0030 55 0 36658 36713
22 1.0000 0.0001 0.0002 4 0 36709 36713
23 1.0000 0.0000 0.0001 1 0 36712 36713
Could it be that at 60X there's a higher chance to see sequencing error and that's penalized significantly in the current statistical implementation
It is more likely because the HiFi training data we used to train the Sequel 2 model is just ~30-40x HG003 54x, HG004 52x. The model was not sufficiently trained with higher coverage samples like 60x, thus making more mistakes. ClairS supports tumor raw coverage up to ~80x, and will downsample the input on-the-fly to ~80x if exceeded. To solve the problem, we can either increase the training data coverage to 80x or above, or tune ClairS to downsample HiFI data to a lower coverage say 30x. Obviously, the former produces better precision and recall.
@aquaskyline Thanks for your comment. That makes sense. I guess at this stage there's not much I can do from my end, but I should mention that I will be happy to help with the need for data and benchmarking if that's of interest to your group. Feel free to email me directly at kpin@pacb.com moving forward. We're hoping to plug the gap in small variants calling for HiFi cancer dataset as soon as possible :)
Hi, may I know if you're planning to enable somatic indel calls with PacBio HiFi? If not, would you be able to provide an instruction on how to train our own model for ClairS with the latest GIAB data from Revio? They're available here: https://downloads.pacbcloud.com/public/revio/2022Q4/
Thank you.