Ground Truths required for training

bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq

http://bioinform.github.io/somaticseq/

BSD 2-Clause "Simplified" License

189 stars 53 forks source link

Ground Truths required for training #121

Closed harish0201 closed 1 year ago

harish0201 commented 1 year ago

Hi!

Thank you for the wonderful documentation and the tool! Please excuse if the question seems stupid.

We are trying to generate the model on our own datasets (TN pairs, in canine) and were wondering if the Truth vcfs needed to be somatic calls?

We do have a germline vcf which we had used to recalibrate the alignments with, and is fairly extensive as it has samples (>500) from across the globe.

Regards, Harish

litaifang commented 1 year ago

Generally speaking, in human cancer, even the cancer with the highest mutation burden have about 50K somatic mutations but 5 million germline SNPs. So when you do a first-pass somatic mutation calling, it's quite common to have more false positives due to germline variants than actual somatic mutations. So to train a classifier for somatic mutations, the germline variants need to be filtered out of the truth vcf and are considered false positives.