dhimmel / learn

Machine learning and feature extraction for the Rephetio project
https://doi.org/10.15363/thinklab.d210
4 stars 5 forks source link
drug-repurposing edge-prediction logistic-regression machine-learning rephetio

Machine learning for Project Rephetio

Latest Zenodo DOI

Systematic predictions of whether a compound treats a disease using hetnet data integration.

This is the machine learning repository for Project Rephetio. The repository covers:

For a comprehensive description of Project Rephetio, see:

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
eLife (2017-09-22) DOI: 10.7554/eLife.26726

The predictions from this repository are browsable at het.io/repurpose.

Execution and directories

The computations in this repository are performed by a series of Jupyter notebooks, using approximately the conda environment specified here. config.ini provides version information for external data dependencies.

This repository is operated in the following order:

  1. summary: extract connected compounds and connected diseases as well as the gold standard of disease-modifying indications to be used throughout this repository.
    • optimize: analyses for benchmarking and optimizing our Cypher queries.
    • prior: compute the prior probability of treatment between each compound and disease.
    • all-features: extract and transform features for all 1,206 metapaths on Hetionet v1.0 and the 5 permuted derivatives. For efficiency, only a subset of compound–disease pairs are analyzed. Assess the performance of each feature separately.
    • validate: create indication sets to systematically evaluate the performance of predictions.
    • prediction: extract and transform features for all 209,168 compound–disease pairs on pre-selected features. Predict the probability of treatment for each compound–disease pair. Export metapath and path contribution information.

Note however that since the repository structure evolved over time, it may not be possible to rerun all notebooks sequentially. Some notebooks may assume a previous version of the repository, hence requiring reversion to a past commit. Furthermore, until recently large datasets were not tracked and may have to be regenerated. Now Git LFS is used to track large files.

Hetnet & Neo4j server nomenclature

The hetnet nomenclature of this repository predates the naming and versioning system of Hetionet v1.0. Confusingly, the hetnet referred to in this repository as rephetio-v2.0 is more accurately hetionet-v1.0. Accordingly, rephetio-v2.0_perm-1 is hetionet-v1.0_perm-1. This repository queries Hetionet v1.0 through Neo4j servers residing in a local clone of dhimmel/integrate. Archives of these Neo4j database stores are available for the unpermuted and permuted hetnets, using the newer hetionet nomenclature.

Questions & Feedback

For questions or feedback related to the code or data in this repository, please use GitHub Issues. For other questions related to Project Rephetio, please comment on the relevant discussion or report section on Thinklab. Both venues support markdown formatting. To keep things organized, please create new issues/discussions unless the subject is directly related to previous content in an existing thread.

License

All original content in this repository is released under CC0 1.0. The repository includes Disease Ontology and DrugBank identifiers, which may impose additional reuse restrictions.