[Ask for insights on Illumina results regarding ClairS workflow/design choices]

HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling

BSD 3-Clause "New" or "Revised" License

71 stars 7 forks source link

[Ask for insights on Illumina results regarding ClairS workflow/design choices] #15

Closed quito418 closed 1 year ago

quito418 commented 1 year ago

Hello ClairS Team,

Firstly, I'd like to express my gratitude for your outstanding research and for making the code available to the community.

In Illumina dataset, I am curious on what point ClairS provides advantage over other baselines such as VarNet and NeuSomatic. (workflow/design or synthetic dataset construction or larger or better model architecture)

Have you compared SNP/INDEL calling performance between ClairS and (VarNet or Neusomatic) when they are trained on identical datasets? I am curious whether the amount and quality of the training dataset have impacted the performance the most or not. (I saw the ablation study on your supplementary material regarding Nanopore dataset)
What is the key factor that improved the performance over Neusomatic and VarNet?

Thank you for your time and insights.

Best regards, Quito418

zhengzhenxian commented 1 year ago

Hi, @quito418,

Thanks for your interest and the great questions!

The training design of ClaiS might be different from VarNet or Neusomatic. ClairS relied on reliable synthetic data for model training. To our knowledge, VarNet was trained on hundreds of real tumor panels to achieve robust performance. Neusomatic was trained in synthetic data with the spike-in mutations as well as real panel data for training, as described here. Training all three callers on the same dataset would be challenging due to the design. And please kindly let us know if you have any findings on it.
Our comparison with Neusomatic and VarNet lies on short-read data, we are more eager to know how the data synthesis workflow performed in short-read data, as currently there is no reliable long-read somatic variant caller for evaluation. We hope that ClairS can serve as a complement to other short-read callers, especially in specific regions where they may fall short.

quito418 commented 1 year ago

Awesome, thank you for the kind response!

quito418 commented 1 year ago

Hi,

I've noticed that the latest publicly available version of NeuSomatic appears to be v0.2.1 from 2019. Can you confirm this was used for the comparison with ClairS?

Additionally, the linked paper that discusses the updated results of NeuSomatic trained on various datasets – is it currently not publicly accessible? or have you used model from those?

Thank you for your assistance.

zhengzhenxian commented 1 year ago

@quito418

Yes, we used the latest version of NeuSomatic(v0.2.1).

I checked and seems their training material is not publicly available now. Those models are trained with SEQC source, which includes the HCC1395/BL dataset that we used for benchmarking. Hence, we did not include those models for benchmarking for a fair comparison.

quito418 commented 1 year ago

Thank you for providing the specification!