Documentation: training data to create a model

HKU-BAL / ClairS-TO

ClairS-TO - a deep-learning method for tumor-only somatic variant calling

BSD 3-Clause "New" or "Revised" License

44 stars 3 forks source link

Documentation: training data to create a model #14

Closed RunningMatcha closed 2 months ago

RunningMatcha commented 3 months ago

Hi @aquaskyline,

Thanks for developing such a great tool! I am currently testing your variant caller with ONT data originated from different organisms, and it could nicely recognize SNPs but I am having troubles with short deletions. I was wondering whether it makes sense for me to train it with my own data to create a model. I saw that for Clair3 it is explained in the documentation how to train data, but I did not find the documentation for ClairS-TO. If this is possible, would you please add in the documentation how users can create their own models?

Thank you very much!

aquaskyline commented 3 months ago

Preparing training data for ClairS and S-TO is much more complicated than Clair3 because it uses synthetic data, and it seldom improves the calling performance because of the scale and heterogeneity needed in training data. On the other hand, you mentioned missing short deletions, would it be possible for you to send us some IGV screenshots of the missing short deletions so we could see if there is a solution.

RunningMatcha commented 3 months ago

Hi aquaskyline.

Thank you for your quick reply! I was perhaps too naive and thought that if we work with synthetic DNA the accuracy would improve if we train ClairS-TO with our sequences (not always derived from human DNA). Have you tested ClairS-TO with different organisms? After your reply, I applied stricter filtering parameters and selected the proper model (by mistake, I had as ClairS-TO input "sup" model even though my data was called with "hac" mode). Now I can see the expected deletions ;)

By the way, which filtering parameters do you recommend for fastq pre-processing?

Here is the example of a region with known deletion with relaxed filtering parameters (q >10)

Here is the same sample, but (q > 13). I lost a lot of reads though.

aquaskyline commented 3 months ago

Great that you rediscovered the deletions. In terms of fastq pre-processing parameters, the diversity of tumor, sequencing setup, and ONT data itself actually favors no single perfect preset. So it is worth some tuning efforts on a big batch of samples. But if one is working on a few samples, the default usually works pretty well already.