MaastrichtU-IDS / predict-drug-target

Using ESM2 protein embeddings and MolecularTransformer drug embeddings to train a linear classifier to predict potential drug-targets interactions
https://predict-drug-target.137.120.31.160.nip.io
MIT License
5 stars 2 forks source link
drug-target-interactions

💊🎯 Predict drug target interactions

This project uses ESM2 protein embeddings and MolecularTransformer drug embeddings to train a linear classifier to predict potential drug-targets interactions, where targets are proteins.

Services deployed:

📥 Install

If you are not in a docker container you might want to create and activate local environment before installing the module:

python -m venv .venv
source .venv/bin/activate

Install the module:

pip install -e .

🍳 Prepare data and embeddings

Run the complete pipeline to download data and generated embeddings for drugs/target:

./prepare.sh
Or click here to follow the pipeline step by step Query the Bio2RDF endpoint to get drugs and their smiles, targets and their protein sequences, and the set of known drug-target pairs ```bash ./get_bio2rdf_data.sh ``` Process the Bio2RDF data to generate the inputs needed for the two embeddings methods ```bash python src/predict_drug_target/prepare.py ``` Install the ESM library ```bash pip install git+https://github.com/facebookresearch/esm.git ``` Generate the protein embeddings ```bash esm-extract esm2_t33_650M_UR50D data/download/drugbank_targets.fasta data/vectors/drugbank_targets_esm2_l33_mean --repr_layers 33 --include mean ``` Install the [Molecular Transformer Embeddings](https://github.com/mpcrlab/MolecularTransformerEmbeddings) ```bash git clone https://github.com/mpcrlab/MolecularTransformerEmbeddings.git cd MolecularTransformerEmbeddings chmod +x download.sh ./download.sh ``` if you get an error (bash: ./download.sh: /bin/bash^M: bad interpreter: No such file or directory) running the download script, then run dos2unix Generate the drug embeddings ```bash python embed.py --data_path=../data/download/drugbank_smiles.txt mv embeddings/drugbank_smiles.npz ../data/vectors/ cd .. ```

🏋️ Run training

To force using a specific GPU set the environment variable CUDA_VISIBLE_DEVICES (starting from 0, so if you have 3 GPUs you can choose between 0,1 and 2):

export CUDA_VISIBLE_DEVICES=1

Train the model:

python src/predict_drug_target/train.py

Results are in the results/ folder, model pickle goes to the models/ folder

🔮 Get predictions

Run the prediction workflow for 2 entities:

python src/predict_drug_target/predict.py

Users provides drugs and targets using their CHEMBL or Ensembl IDs, the script will test all provided drugs against all provided targets, and return a prediction score (how confident we are that the drug interacts with the target) for each drug-target pair.

✅ Run tests

Run the code formatting (black and ruff):

hatch run fmt

Run the tests (requires to first run the training to generate the model):

pytest
# Or
hatch run test

Compile the requirements.txt file with pinned versions:

hatch run requirements

🐳 Deployment

With docker compose. First run the training to generate the model

Deploy the API

Deploy the TRAPI endpoint on a GPU server:

docker compose up -d --build --force-recreate

Deploy the vector db

The vectordb is used to store embeddings for the entities and make querying faster. It is currently hosted on a server.

To run it locally, edit the host in the src/predict.py script. And use the docker-compose.yml and config files from the vectordb folder (make changes as needed)

cd vectordb
docker compose up -d

Which vector db?

It's the new hot thing in the database world: databases for "modern" AI. To store and query embeddings.

There are a few solutions, more or less mature, here are the runner ups:

Some references:

☑️ TODO