dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link

Trained logistic regression classifier for the Hetionet drug repurposing paper #5

Open dkoslicki opened 5 years ago

dkoslicki commented 5 years ago

Hello Sergio and Daniel,

My students and I came across your drug repurposing paper as we were putting a manuscript on a similar topic as a part of the NIH NCATS Translator project. In short, we've constructed a Neo4j database with 125K bioentities and 7.6M relationships and are using a node vectorization algorithm and random forests to predict possible drug repurposing targets. We wanted to compare our approach to the "metapath" approach that you've taken. We've found the het.io/repurpose website, but can't seem to find the code on your Github page that does the actual logistic regression classifications.

Do you think you could either point us to (or send us) the classifier code or the trained classifier so we can do this comparison?

dhimmel commented 5 years ago

Great to hear about your project and interest in comparing the edge prediction algorithm we used for Project Rephetio.

Sounds like you've already constructed your hetnet, so there are two main steps to make edge predictions.

  1. Compute features describing pairs of nodes along a given metapath. For this we recommend the degree-weighted path count (DWPC) metric. For Project Rephetio, we used a cypher implementation of the DWPC. The queries are constructed in this notebook and executed in this notebook. The actual cypher command for a specific metapath's DWPC is generated using the hetio.neo4j.construct_dwpc_query function. To use this method, you will probably want to encode your database's schema as a hetio.hetnet.MetaGraph object as per cell 2 here.

    However, computing DWPCs via cypher can be slow. Matrix multiplication is much faster because it does not have to remember the actual paths, but instead their sum at any given point along the metapath. Our new hetmatpy package has functions for computing DWPCs using matrix multiplication and should be much faster.

    How many node and edge types does your network have? What is the node type with the highest count of nodes? We've used hetmatpy on Gene by Gene matrices (up to 20,000 squared values). How many node pairs do you want to learn on and make predictions for? With this information I can give a better recommendation.

  2. Construct a regularized logistic regression model with node-pairs as observations and DWPCs as features. For Project Rephetio, we fit this model with the R glment package in the 4-predictr.ipynb notebook. We also suggest including a prior probably for prediction edges generated using the XSwap algorithm. This prior probability is helpful in accounting for degree effects and may be interesting for you for other prediction methods as well.

If it is of any further help, the Project Rephetio algorithm has been used by an independent team of researchers for the preprint Time-resolved compound repositioning predictions on a texted-mined knowledge network. I believe their implementation is available at mmayers12/hetnet-ml.