dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link

Why train the model with only one data set ? #8

Open lingling93 opened 4 years ago

lingling93 commented 4 years ago

Hi Daniel:

I'm wondering that, you have four valid data sets which contain drug-disease pairs, why not train the final model with all the data we know ? Do you think it is a good idea?

Lingling

dhimmel commented 4 years ago

why not train the final model with all the data we know?

For the Project Rephetio study, we wanted to have some hold-out treatments for evaluating our final performance.

However, since then, I have been wanting to try out training on indications in clinical trials. If you exclude the disease-modifying treatments, there are 5,594 treatments in this clinical trials set. If you still want holdout testing data, you could set aside some proportion of these 5,594 pseudo-treatments. This approach could have several benefits compared to what we did in Rephetio:

The big downside to training on clinical trials is that they are not all disease-modifying indications. However, I think it's reasonable to assume that clinical trails enrich for true treatments compared to random compound-disease pairs. My understanding is that classifiers like our regularized logistic regression will do just fine with imperfect positives and negatives and that larger sample size is probably more important than the perfection of the class labels.