bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

model overfitting #91

Closed gianfilippo closed 4 years ago

gianfilippo commented 4 years ago

Hi,

I looked at some basic results stats and the estimated Ti/Tv is a little above 1, well below what I understand is the expected ~3.0 for WES. That should suggest that i have a plenty of FPs.

Going back to the model, I see that the Train Error is always 0. I am attaching one example output file for one of the samples.

Is the model overfitting ? what should I expect as errors ?

Thanks model.out.txt

litaifang commented 4 years ago

Overfitting is certainly possible for machine learning. Though for adaBoost, even though the training's max number of trees is 500, the predictor (https://github.com/bioinform/somaticseq/blob/master/r_scripts/ada_model_predictor.R) script only uses the first 300 trees to reduce the likelihood of overfitting (n.iter=300).

The Ti/Tv ratio is mostly for germline SNPs. For somatic mutations, the mutation profiles depend on the tumor type. It can be ~1 for many tumors: https://www.biostars.org/p/104473/ and https://www.nature.com/articles/nature12477.

gianfilippo commented 4 years ago

thanks for the links......very useful