Machine learning for Project Rephetio

Systematic predictions of whether a compound treats a disease using hetnet data integration.

This is the machine learning repository for Project Rephetio. The repository covers:

extracting features from Hetionet v1.0 and its permuted derivatives.
computing the performance of each metapath-based feature, as available in this interactive table.
computing the prior probability of treatment via edge-swap permutation
fitting a regularized logistic regression model to predict the probability that each compound treats each disease.
evaluating the performance of predictions on several catalogs of medical indications.
for each prediction, computing the contributions of specific metapaths and paths.

For a comprehensive description of Project Rephetio, see:

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
eLife (2017-09-22) DOI: 10.7554/eLife.26726

The predictions from this repository are browsable at het.io/repurpose.

Execution and directories

The computations in this repository are performed by a series of Jupyter notebooks, using approximately the conda environment specified here. config.ini provides version information for external data dependencies.

This repository is operated in the following order:

summary: extract connected compounds and connected diseases as well as the gold standard of disease-modifying indications to be used throughout this repository.
- optimize: analyses for benchmarking and optimizing our Cypher queries.
- prior: compute the prior probability of treatment between each compound and disease.
- all-features: extract and transform features for all 1,206 metapaths on Hetionet v1.0 and the 5 permuted derivatives. For efficiency, only a subset of compound–disease pairs are analyzed. Assess the performance of each feature separately.
- validate: create indication sets to systematically evaluate the performance of predictions.
- prediction: extract and transform features for all 209,168 compound–disease pairs on pre-selected features. Predict the probability of treatment for each compound–disease pair. Export metapath and path contribution information.

Note however that since the repository structure evolved over time, it may not be possible to rerun all notebooks sequentially. Some notebooks may assume a previous version of the repository, hence requiring reversion to a past commit. Furthermore, until recently large datasets were not tracked and may have to be regenerated. Now Git LFS is used to track large files.

Hetnet & Neo4j server nomenclature

The hetnet nomenclature of this repository predates the naming and versioning system of Hetionet v1.0. Confusingly, the hetnet referred to in this repository as rephetio-v2.0 is more accurately hetionet-v1.0. Accordingly, rephetio-v2.0_perm-1 is hetionet-v1.0_perm-1. This repository queries Hetionet v1.0 through Neo4j servers residing in a local clone of dhimmel/integrate. Archives of these Neo4j database stores are available for the unpermuted and permuted hetnets, using the newer hetionet nomenclature.

Questions & Feedback

For questions or feedback related to the code or data in this repository, please use GitHub Issues. For other questions related to Project Rephetio, please comment on the relevant discussion or report section on Thinklab. Both venues support markdown formatting. To keep things organized, please create new issues/discussions unless the subject is directly related to previous content in an existing thread.

License

All original content in this repository is released under CC0 1.0. The repository includes Disease Ontology and DrugBank identifiers, which may impose additional reuse restrictions.

dhimmel / learn

readme

Machine learning for Project Rephetio

Execution and directories

Hetnet & Neo4j server nomenclature

Questions & Feedback

License