HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling
BSD 3-Clause "New" or "Revised" License
67 stars 7 forks source link

add v4.3.0 model for clair3 params #19

Closed bakeronit closed 5 months ago

bakeronit commented 7 months ago

Hi, can you add the option to use the latest clair3 model for germline variant elimination? Thank you.

zhengzhenxian commented 7 months ago

Hi, @bakeronit

The Clair3 model option (--clair3_model_path) is provided as an experimental option, please kindly keep in mind to use the correct ClairS model with the correspond Clair3 model to get reliable results, thanks!

Zhenxian

bakeronit commented 7 months ago

Oh, yes. So you mean I have to use the v4.3.0 model for clairS to match with the clair3 v4.3.0 (if I did basecalling with v4.3.0 model)?

zhengzhenxian commented 7 months ago

The Clair3 v4.3.0 model (r1041_e82_400bps_sup_v430) should be the latest model trained by ONT? As I found the details here. The sequencing chemistry used is R10.4.1 E8.2 (5kHz).

So, you can use the -p ont_r10_dorado_5khz to use the corresponding ClairS model for it, the model was also trained using the same chemistry as Clair3 v4.3.0 model, more detail is listed here.

bakeronit commented 7 months ago

Yes, I got it from Rerio. I'll have a try, thanks.

bakeronit commented 7 months ago

Hi, I would like to add up here. I noticed you have the new update v0.1.7 with the hac and sup model for ont data called with v420. I tried the latest clairS run with ont_r10_dorado_sup_5khz and with clair3 model r1041_e82_400bps_sup_v430 (my data was base called with v430). I did find much better recall rates but lower precision.

I am wondering if this is caused by unmatched clairS-clair3 models. and are we expecting an ont_r10_dorado_sup_5khz_v430 soon?

we do appreciate the higher recall, but curious why using the previous ont_r10_dorado_5khz in clairS ends with better precision?

Thank you

zhengzhenxian commented 7 months ago

The two Clair3 germline models were trained using the same datasets. However, the v430 model specifically benefits from additional training data compared to v420, particularly in high-coverage and low-coverage scenarios.

Hence, I am afraid we will not release an "updated" ont_r10_dorado_sup_5khz_v430 model as we are also using the same training data for somatic model. The primary distinction between the result lies in the germline phasing results themselves. We recommend you select the appropriate model for your data.