KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
49 stars 17 forks source link

Should I train a new model ? #19

Open Guo-Weihua opened 4 years ago

Guo-Weihua commented 4 years ago

Dear Collin, I want to run 2020puls using my own pan-cancer data without silent mutations(total mutation num >130, 000) to predict oncogene and TSG of Pan-cancer and type specific cancer. Should I train a new model using my data with –config drop_silent=”yes” followed by running predict or just run pretrained_predict using your pre-trained 20/20+ classifiers with the same config above? Thanks.

ctokheim commented 9 months ago

Ideally one would train an entire new model where silent mutations were not included to then apply it on additional data where they also weren't included. In general, scores will skew higher when no silent mutations are included in your data when scored used a model that was trained on data that contained silent mutations. However, as you noticed by the option, a reasonable workaround is to adjust what is considered a significant score by accounting for the fact that silent mutations are not included in the monte carlo simulations. This should help reduce potential biases, but ideally you should check the p-values and see if there are artificially large number of significant results for your data. If that is the case, then you may need to train a new model.