bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Pretrained Classifier #114

Closed kokyriakidis closed 1 year ago

kokyriakidis commented 2 years ago

Hi!

I want to get some things straight regarding how Somaticseq works:

  1. I guess that Somaticseq, in consensus mode performs just a majority vote from the individual callers. No other filtering or scoring is performed? Or is it possible to train a classifier just from the output of Somaticseq in consensus mode (eg. Somaticseq with multiple callers -> trainning on extracted features --> Somaticseq in prediction mode)

  2. I saw that I can use SomaticSeq in prediction mode given a pretrained model. Do you have a pretrained classifier file from SEQC2 and BamSurgeon? Can I use this pretrained classifier? Is it available?

I have a tumor only BAM file.

litaifang commented 2 years ago

1) Yes, consensus mode is just a majority vote to label "PASS" in the VCF file. Currently, the training is done on the .tsv file, which include all the output regardless if they were majority voted or not. If you want to train with only calls that are majority voted, you'd need to manually trim the .tsv files and then manually invoke somatic_xgboost.py to train the classifiers. 2) I do not have a pre-trained model in the repo. The ones used for SEQC2 were on version 2.8.x+, using adaBoost in R, which is kinda obsolete now. The classifiers depend on what callers were used to create the mutation candidates in the .tsv files. We currently recommend xgboost (python implementation).

kokyriakidis commented 2 years ago

What's the best practice? Include all the output or train with only calls that are majority voted? What do you recommend?

Which tools do you reccommend for oncopanel samples? Do you recommend using all of them? Some of them? Do you have any hints?

litaifang commented 2 years ago

The "best practice" is to train with everything in the .tsv file, which is the default behavior from somaticseq_parallel.py. Although, you can train with multiple data sets manually by doing somatic_xgboost.py train -out SNV.classifier -threads 8 -tsvs snvs_1st_samples.tsv snvs_2nd_samples.tsv ....