Why train the model with only one data set ?

why not train the final model with all the data we know?

For the Project Rephetio study, we wanted to have some hold-out treatments for evaluating our final performance.

However, since then, I have been wanting to try out training on indications in clinical trials. If you exclude the disease-modifying treatments, there are 5,594 treatments in this clinical trials set. If you still want holdout testing data, you could set aside some proportion of these 5,594 pseudo-treatments. This approach could have several benefits compared to what we did in Rephetio:

greater number of positives, which could yield models that draw from more metapaths. If you remember from Figure 2, many of the metapaths through high-throughput/systematic edges were given zero-coefficients in the logistic regression model despite showing predictive ability according to Δ AUROC. I think it is possible that features with smaller effects may be retained by the model were the limiting sample size (that of treatments, i.e. positives) increased.
training positives and negatives would never have treatment edges in the hetnet. This would help address edge-dropout contamination. In other words, I think it's best if you can train on compound-disease pairs without any direct connections in the network. This could also result in better models, because IIRC we struggled to avoid the deleterious effects of edge-dropout contamination.

The big downside to training on clinical trials is that they are not all disease-modifying indications. However, I think it's reasonable to assume that clinical trails enrich for true treatments compared to random compound-disease pairs. My understanding is that classifiers like our regularized logistic regression will do just fine with imperfect positives and negatives and that larger sample size is probably more important than the perfection of the class labels.

dhimmel / learn

Why train the model with only one data set ? #8