Closed Inyrkz closed 8 months ago
I am a data scientist eager to explore the world of open source. I'm from Nigeria. I’ve never worked on an open-source project before, I find it confusing and complicated. My friend, Love, encouraged me to give it a try. She introduced me to Outreachy and told me she learned a lot from participating in Outreachy last year. That’s why I joined Outreachy. The more I learn about open-source, the more enthusiastic I get. I want to contribute my skills and learn from the experience.
I am proficient in Python, R, SQL, Git, Scikit-Learn, TensorFlow, Keras, etc. I’m currently learning MLOps. I want to learn more about model deployment and monitoring.
Why Ersilia?
One of my favourite applications of AI is in the Healthcare industry. I’m interested in using AI to save lives. I love the work Ersilia is doing, using AI to discover drugs against neglected infectious diseases, and making the process easy for scientists to use these models in their research. I believe this work will be useful to a lot of people, especially Africans because we suffer the impact of diseases a lot. Like in the case of Ebola and COVID-19, our government couldn't handle it well. This is what inspires me to apply AI in the healthcare industry.
I have some experience in doing cancer research with genome data. In my internship as a machine learning engineer at Rayca Precision, I got to work on using machine learning to fight cancer using patients’ gene expression matrix data. Rayca Precision is a startup company dedicated to accelerating drug discovery, reshaping precision oncology, and elevating the understanding of intricate biological systems.
I worked on different projects including classifying the cancer histological types, detecting keratinizing squamous cell carcinoma, predicting lymphovascular invasion, etc. I also got to read research papers and try to implement state-of-the-art models from the research papers. At first, it was challenging because I didn't have any background in oncology. But I put in the work to get the job done. I was able to help the company set up and run an open-source model. I found this work exciting and fulfilling. I would love to contribute to a team dedicated to impacting biomedical research.
As an undergrad, my lecturers and professors gave me an opportunity to collaborate with the Computer Engineering Master's and PhD students at the University of Uyo, to teach and help them implement machine learning research projects. This involved reading a lot of machine learning research papers to figure out how to improve existing models. I got to apply machine learning to different fields. I enjoy seeing how different people use machine learning in research. This will be a great opportunity for me to review biomedical-related research papers.
In my MLOps journey, one of the skills I want to learn is Docker. I see that it is a major tool that Ersilia uses. Working with Ersilia will not only help me learn docker but also apply it in real-world projects.
This one is a bit personal. My friend and my mom depend on drugs for their health. I know there are millions of people around the world who need drugs to survive. Ersilia is helping scientists make the drug discovery process faster and easier. I would love to help out anyway I can.
Participating in this internship will definitely help advance my career. It will:
After this internship, I plan to continue contributing to Open-Source projects with the skills and experience I gain here.
I am grateful for the opportunity to participate in Outreachy. I will do my best to make impactful contributions.
Thank you for the updates @Inyrkz you can get started with week 2 tasks.
Alright, I'll get started.
I picked the first model, NCATS Rat Liver Microsomal Stability.
When I was a child, I heard how scientists usually test drugs on animals first to know if it's good, before testing them on humans. When I saw Rat in the project title, I was curious.
After reading the research papers by the authors Vishal et al. here. I learned that they were applying machine learning to predict compounds' stability in rat liver microsomes. It’s a classification problem where the model predicts if the compound belongs to either the Stable or Unstable class.
They used the Scikit-Learn library in Python and used the 5-fold cross-validation technique for evaluation. They used the random forest classifier, artificial neural networks, a graph convolutional neural network (I’ve never used this before. It picked my interest.), and a recurrent neural network algorithm for training.
I forked their repo and cloned it using the command git clone --recursive https://github.com/ncats/ncats-adme.git
They mentioned that the –recursive flag should be used when cloning the repo.
Setting up the project required anaconda
or miniconda
. I have anaconda
set up on my system.
I open my Terminal on Ubuntu. I was to navigate to the path with the folder ADME_RLM
, and then into the server directory. But I didn’t find the ADME_RLM
directory. I found the server directory.
cd ncats-adme
cd server
I created a virtual environment and installed the required packages in the environment.yml
file.
conda env create --prefix ./env -f environment.yml
I activated the virtual environment with the code below.
conda activate ./env
I ran the application with the code
python app.py
Then I open it on Google Chrome by browsing to localhost http://127.0.0.1:5000/
I clicked on Predict and navigated to the Text file to upload the eml_canonical.csv
test file.
For the options to choose the models to predict, I only selected RLM stability
and HLC stability
, which represent Rat Liver Microsomal Stability and Human Liver Cytosolic Stability.
I clicked on the Process file
button.
It displayed the predictions of the Rat Liver Microsomal Stability and Human Liver Cytosolic Stability for all the records in the file.
Images showing prediction results
The prediction result has three tables; the molecule (which shows the molecule diagram), the predicted class (showing the prediction class, 0 or 1, and the confidence of the prediction), and the prediction.
Class 0 represents stable. It means the molecule is stable. Class 1 represents unstable. It means the molecule is unstable. From the first image, the first prediction shows 0 (0.95)
. This means the model is 95% confident that the molecule is stable.
I ran predictions on the Human Liver Cytosolic Stability too because I was curious. I wanted to see if the compounds that are stable in rats are also stable in humans. It turns out the second compound was stable in rats but unstable in humans. Also, the model's confidence for the Human Liver Cytosolic Stability is much lower than that of the Rat Liver Microsomal Stability.
I navigated to the Ersilia Model Hub. I clicked on the tab Microsomal stability. It narrowed down the search to Human Liver Microsomal Stability
and Rat liver microsomal stability
. I clicked on the Rat liver microsomal stability
and was redirected to the Ersilia GitHub repo of the model. I read the README.md in the repo and clicked on the DockerHub link.
I found the ersilia identifier of the model from the GitHub repo, eos5505.
The Rat liver microsomal stability
model code is eos5505
.
I opened a terminal and ran the code below to fetch the model.
ersilia -v fetch eos5505
I got this error.
Traceback (most recent call last):
File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/client.py", line 214, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
^^^^^^^^^^^^^^
...
File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/client.py", line 221, in _retrieve_server_version
raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
Based on the discussion in the Slack channel, I decided to use a VPN. I used the ProtonVPN, and the model fetching was successful.
Output:
23:49:41 | DEBUG | Schema: {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'outcome': {'type': 'numeric_array', 'shape': (1,), 'meta': ['rlm_proba1']}}}
23:49:41 | DEBUG | Done with the schema!
23:49:41 | DEBUG | This is the schema {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'outcome': {'type': 'numeric_array', 'shape': (1,), 'meta': ['rlm_proba1']}}}
23:49:41 | DEBUG | API schema saved at /home/affiah/eos/dest/eos5505/api_schema.json
23:49:53 | DEBUG | Fetching eos5505 done in time: 0:19:30.772344s
23:49:53 | INFO | Fetching eos5505 done successfully: 0:19:30.772344
👍 Model eos5505 fetched successfully!
I served the model with the code.
ersilia serve eos5505
Output:
🚀 Serving model eos5505: ncats-rlm
URL: http://127.0.0.1:59227
PID: 39560
SRV: conda
👉 To run model:
- run
These APIs are also valid:
- predict
💁 Information:
- info
I ran the model using the code below.
ersilia api run -i ~/Desktop/eml_canonical.csv -o output.csv
I got this error:
File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 353, in api
return self.api_task(
^^^^^^^^^^^^^^
File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 368, in api_task
for r in result:
File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 195, in _api_runner_iter
for result in api.post(input=input, output=output, batch_size=batch_size):
File "/home/affiah/Downloads/ersilia/ersilia/serve/api.py", line 319, in post
for res in self.post_unique_input(
File "/home/affiah/Downloads/ersilia/ersilia/serve/api.py", line 290, in post_unique_input
or not schema.is_h5_serializable(api_name=self.api_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/affiah/Downloads/ersilia/ersilia/serve/schema.py", line 91, in is_h5_serializable
schema = self.get_output_by_api(api_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/affiah/Downloads/ersilia/ersilia/serve/schema.py", line 88, in get_output_by_api
return self.schema[api_name]["output"]
~~~~~~~~~~~^^^^^^^^^^
KeyError: 'run'
I decided to just test on one of the observations with the code below.
ersilia run -i "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
Output:
{
"input": {
"key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
"input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
"text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
},
"output": {
"outcome": [
0.595
]
}
}
It works!
I just need to figure out how to make the CSV file run.
I tried the command ersilia run -i ~/Desktop/eml_canonical.csv -o output.csv
and it worked!
Here's the data in the output.csv
file:
key | input | rlm_proba1 |
---|---|---|
MCGSCOLBFJQGHM-SCZZXKLOSA-N | Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 | 0.049 |
GZOSMCIZMLWJML-VJLLXTKPSA-N | C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | 0.595 |
BZKPWHYZMXOIDC-UHFFFAOYSA-N | CC(=O)Nc1sc(nn1)S(=O)=O | 0 |
QTBSBXVTEAMEQO-UHFFFAOYSA-N | CC(O)=O | 0 |
PWKSKIMOESPYIA-BYPYZUCNSA-N | CC(=O)NC@@HC(O)=O | 0 |
BSYNRYMUTXBXSQ-UHFFFAOYSA-N | CC(=O)Oc1ccccc1C(O)=O | 0.13 |
MKUXAQIIEYXACX-UHFFFAOYSA-N | NC1=NC(=O)c2ncn(COCCO)c2N1 | 0.001 |
ASMXXROZKSBQIH-VITNCHFBSA-N | OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 | 0.983 |
ULXXDDBFHOBEHA-CWDCEQMOSA-N | CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 | 0.305 |
HXHWSAZORRCQMX-UHFFFAOYSA-N | CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 | 0.542 |
OFCNXPDARWKPPY-UHFFFAOYSA-N | O=C1N=CN=C2NNC=C12 | 0 |
YVPYQUNUQOZFHG-UHFFFAOYSA-N | CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I | 0.003 |
LKCWBDHBTVXHDL-RMDFUYIESA-N | NCCC@HC(=O)N[C@@H]1CC@HC@@HC@H[C@H]1O[C@H]3OC@HC@@HC@H[C@H]3O | 0.003 |
XSDQTOBWRPYKKA-UHFFFAOYSA-N | NC(N)=NC(=O)c1nc(Cl)c(N)nc1N | 0.013 |
IYIKLHRQXLHMJQ-UHFFFAOYSA-N | CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 | 0.954 |
Observations
0.049
is less than 0.5, the threshold. This matches the prediction of the original model.0.595
is greater than 0.5. This also matches the prediction of the original model.Ersilia's model predictions match the original model predictions.
I installed Docker when I set up Ersilia Model Hub in the week 1 task. I installed Docker Desktop for Ubuntu from here. I created an account on the docker hub by using this link.
hi @Inyrkz I ran the application with the code with python app.py but its been loading various models for different predictions, for a long time now hope am still on the write track
Hi @julietowah, you are on the right track. It will take a while to download and load the models. I think the models are about 5GB in size. Just keep an eye on it. If it stops running without showing you the 127.0.0.1
link to navigate to, then run the python app.py
command again.
Once it's done, you'll see this link http://127.0.0.1:5000/
. Navigate to it on your browser and you'll be good to go.
thank you so much
You are welcome
Hi @Inyrkz thank you for the detailed updates, and all the debugging efforts. You can continue your week 3 tasks in this issue.
You're welcome. I'm always happy to help. I'll add my week 3 task.
Link to paper: here Link to GitHub repo: here
Here’s a short story. My friend, Samuel, was sick. His doctor recommended some drugs. We went to find the drugs but most of the pharmacies around the area didn’t have them. We finally found the drugs at a pharmacy. The pharmacist, Idara, looked at the drug list and from the expression on her face, I knew these drugs were a big deal. They were drugs she couldn’t give out without a prescription. She had to fill out some forms first before she could sell the drugs.
She had a cool sense of humour. She didn’t want my friend, Samuel, to feel weird about buying those drugs. I remember at some point she called the drugs poison. This part got my attention. I wondered why a drug designed to help someone could be a poison. I did a little research about drugs and found out about the toxic properties of drugs. Some drugs are toxic which is why they have to be taken in the recommended dosage. Before a drug candidate can be approved it must be screened to make sure it is safe for consumption.
Task: Regression
Tag: Antimicrobial activity
, Antiviral activity
, toxicity
Mode: Pretrained
.
Input Shape: Single
Output Shape: List
Input: Compound
Output: Score
Output Type: Float
People have good intentions when they make drugs, but sometimes the drugs they make can be toxic to people. This is why I’m also interested in this project about toxicity. It is about predicting the toxicity of substances. When scientists are researching drugs to make, they can use AI to check the toxicity of the substances they are using to make the drugs.
I noticed Ersilia have similar models about toxicity prediction including: Toxicity prediction across the Tox21 panel with semi-supervised learning
, Toxicity and synthetic accessibility prediction
, ToxCast toxicity panel
, Toxicity at clinical trial stage
, HepG2 Toxicity - MMV
, and S2DV HepG2 toxicity
. Adding this model to the collection would expand Ersilia’s Hub. Researchers will have more options.
The authors call their model QuantitativeTox
(That is a cool name). They trained it on four datasets: LD50
, IGC50
, LC50
, and LC50-DM.
These datasets contain information about how toxic substances are. They trained their model using an ensemble of five different deep-learning models. In simpler terms, they combined five deep learning models into one.
It’s like the concept of two heads are better than one
.
This is more robust and powerful. To be sure of the model’s performance, they compared it with the best existing model TopTox
(another cool name). QuantitativeTox performed better than TopTox in three out of four datasets. This proves that their technique beat the existing best model.
Researchers can use this model for four kinds of prediction, including: LD50
, IGC50
, LC50
, and LC50-DM
. I did some research to find the meaning of the terms.
LD50
means lethal dose for 50% of the population. The lower the LD50 value, the more toxic the substance is considered.
From the research paper, I learned that the LD50 dataset gives information about the amount of chemical substance needed to cause death in 50% of a group of rats when the chemical is given to them orally (like swallowing it). Usually, when a chemical is taken orally, it causes less harm compared to when it's injected directly into the bloodstream. The LD50 dataset helps us understand how toxic a substance is when ingested orally.
IGC50
means Inhibitory Growth Concentration for 50%. That is the concentration of an antimicrobial agent, e.g. antibiotics, required to prevent the growth of a bacterial culture by 50%. This is for seeing the toxicity to microorganisms. The lower the IGC50 value the more powerful the substance is. From the research paper, the IGC50 dataset shows the concentration of a chemical compound to arrest the growth of Tetrahymena pyriformis when exposed for 40 hours.
LC50
means Lethal Concentration for 50% of the Population. It is used for gases or airborne particles. It represents the concentration of a substance in the air that, when inhaled by test subjects, results in the death of 50% of the subjects. The dataset gives information on the toxicity of a given compound on fathead minnow, a species of temperate freshwater fish after 96 hours of exposure.
LC50-DM
means Lethal Concentration for 50% of the Population with Direct Mortality It is similar to the LC50. The dataset used gives details of the concentration of a compound in water in milligrams per litre causing 50% population of Daphnia maga to die after 48 hours.
This model isn’t just limited to predicting the toxicity of drug candidates. It also helps predict the toxicity of substances to microbes like bacteria.
From the units of the toxicity measure, I can tell that it is a regression problem. The original datasets are found in Ecotox and chemidheavy.
The models were evaluated using three metrics; R Squared
, Mean Absolute Error (MAE)
, and Root Mean Squared Error (RMSE)
.
I love how the model usage is documented in their GitHub repo here.
I would start by cloning the repo.
git clone https://github.com/Abdulk084/QuantitativeTox
Navigate into the repository.
cd QuantitativeTox
The model was tested on Ubuntu 20.04 with Python 3.7.7. I use the Ubuntu 22.04 operating system, so this won’t be an issue. They also use conda.
The next step is to restore the environment. The environment.yml
file is available for this.
conda env create -f environment.yml
I activate the virtual environment using the command below.
conda activate qtox
Install PyBioMed
.
cd PyBioMed
python setup.py install
cd ..
After the installation of QuatitativeTox, it can be tested for the four various tasks: LD50
, IGC50
, LC50
, and LC50-DM
.
To test on the LD50
task, run the command.
cd LD50
python LD50_test.py
The output is a CSV file with the name LD50_test_results.csv
.
To test on the IGC50
task, run the command.
cd ..
cd IGC50
python IGC50_test.py
The output is a CSV file with the name IGC50_test_results.csv
.
To test on the LC50
task, run the command.
cd ..
cd LC50
python LC50_test.py
The output is a CSV file with the name LC50_test_results.csv
.
To test on the LC50DM
task, run the command.
cd ..
cd LC50DM
python LC50DM_test.py
The output is a CSV file with the name LC50DM_test_results.csv
.
Here’s a sample of what the output file looks like.
pred_test_ext_stack_load_IGC50 | test_ext_IGC50_meta_r2 | test_ext_IGC50_meta_mae | test_ext_IGC50_meta_rmse |
---|---|---|---|
2.7041178 | 0.8611876765738696 | 0.26909787510227223 | 0.3659300114316411 |
1.3349031 | |||
4.9750233 | |||
2.391333 |
I didn’t see any information on the structure of the input file. So I dug a little deeper. In the research paper, I saw that they used the preprocessed train and test sets, which are pairs of SMILES strings and toxicity measures, from TopTox. When I opened the LD50_test.py Python script, I found out they used TensorFlow
for their implementation. TensorFlow is my favourite deep learning framework. The input file is also a CSV file named external_test.csv
.
The model checkpoints are provided.
The research paper can be found here. The GitHub repo is here
I've always wondered how a drug works. How is it that a drug you take for headaches, like paracetamol, finds its way to your head to cure your headache from your stomach? In my biology class, we were taught about the digestive system and how the body digests food. But no one mentioned anything about drugs :eyes: .
I keep learning new things as I research.
Fun fact: Drug discovery is an expensive process. It costs around 1 billion dollars to make a single drug. It can take up to 10 years of development and testing before a drug can be FDA-approved.
Task: Regression
Tag: Drug-likeness
, Molecular weight
, Permeability
, Similarity
, Synthetic accessibility
Mode: Pretrained
. (The model can also be retrained)
Input Shape: Pair of lists
Output Shape: List
Input: Compound
Output: Descriptor
Output Type: Float
Knowing how a drug molecule interacts with a specific protein is a big challenge in drug discovery. This paper introduces EquiBind, a method to predict the binding of molecules to their target proteins, including the location and orientation of the binding. It also focuses on the speed of predicting drug binding structure, as fast models help with fast virtual screening or drug engineering. EquiBind is really fast and it is better than traditional standards in terms of quality.
A major problem this paper addresses is understanding how drug-like molecules (ligands) interact and form complexes (structures formed from combining molecules) with target proteins (receptors) – drug binding
– which is a requirement for virtual screening. Solving this problem will go a long way in drug discovery.
Figure 1. High-level overview of the structural drug binding problem tackled by EQUIBIND
Source: Research paper
From the image above, the process begins with a molecular representation (ligand) in the form of a graph and a 3D shape of a random molecule, generated by the program RDKit/ETKDG, when it's not attached to anything. This work only models the flexibility of the ligand and assumes that the protein is rigid.
After checking the Ersilia Model Hub, I found a limited number of models relating to the prediction of a drug-binding structure. I saw 3 models on drug-likeness and 7 models on similarity. This drug-binding prediction is useful in drug discovery. It will help scientists quickly identify potential drug candidates and how they interact with specific proteins.
Having more models relating to drug binding on Ersilia would go a long way in speeding up how long it takes to discover new drugs.
The authors used a new time-based dataset split and preprocessing pipeline for this project. They used the protein-ligand complexes from PDBBind. PDBBind is a subset of the Protein Data Bank (PDB) that provides 3D structures of individual proteins and complexes. The latest version, PDBBind v2020, contains 19,443 protein-ligand complexes with 3,890 unique receptors and 15,193 unique ligands.
Their test set contained complexes that were discovered in 2019 or after. The training set and validation set only used older complexes. For the data preprocessing, they dropped all complexes that couldn’t be processed by the RDKit library. The data was reduced from 19,443 protein-ligand complexes to 19,119 complexes. Each ligand and receptor was processed with OpenBabel.
A deep neural network algorithm was used to build EquiBind. The authors optimized the model using the Adam optimizer. They employed early stopping with a patience of 150 epochs based on the percentage of predicted validation set complexes with an RMSD better than 2A. The hidden dimension is (32, 64, 100). They used the following activation functions: Leaky-RELU, ReLU, SeLU. Dropout was applied (0, 0.05, 0.1, 0.2). They applied the following normalization technique: BatchNorm, LayerNorm, and GraphNorm.
git clone https://github.com/HannesStark/EquiBind
The processed dataset for the project is available on zenodo. To use it, I can download it. Then unzip it and put it in the data
directory of the repo.Inputs: The ligand files can be of the formats .mol2, .sdf, .pdbqt or .pdb whose names contain the string ligand (your ligand files should contain all hydrogens). The receptor files are of the format .pdb whose names contain the string protein. For each complex we want to predict we need a directory containing the ligand and receptor file. Like this:
my_data_folder
└───name1
│ name1_protein.pdb
│ name1_ligand.sdf
└───name2
│ name2_protein.pdb
│ name2_ligand.mol2
I’d create a new environment with all required packages using environment.yml. Since I’ll be using CPU, I’ll run the code conda env create -f environment_cpuonly.yml
.
Activate the virtual environment conda activate equibind
.
Packages Required: These are the model requirements.
python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard
configs_clean/inference.yml
I’d set the path to our input data folder inference_path: path_to/my_data_folder. Then run:python inference.py --config=configs_clean/inference.yml
Our results would be saved as .sdf
files in the directory specified in the config file under `output_directory: 'data/results/output'
and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt
We can also run inference for multiple ligands in the same .sdf file and a single receptor with the code.
python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf
This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The outputs are 3 files in output_directory with the following names and contents:
failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled. success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded. output.sdf - contains the conformers produced by EquiBind in .sdf format.
The research on EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
led me to this project. The authors of EquiBind also came up with DiffDock.
The research paper can be found here. The GitHub repo is here.
Task: Generative
Tag: Drug-likeness
, ADME
, Permeability
, Similarity
, Synthetic accessibility
, Microsomal stability
Mode: Pretrained
. (The model can also be retrained)
Input Shape: Single
Output Shape: List
Input: Compound
Output: Descriptor
Output Type: Float
At first sight, the project topic looked weird. Then I did some research on molecular docking. I learned that It involves the simulation of how two or more molecules, typically a small ligand (such as a drug candidate) and a receptor (such as a protein), interact at the molecular level. It is used to predict and study the binding interactions between these molecules.
Image from Wikipedia
DiffDock is a state-of-the-art model for molecular docking. New deep learning methods that treat docking as a regression problem, when compared to traditional search-based methods, have a fast runtime, but not an improved accuracy. The creators of DiffDock frame molecular docking as a generative modelling problem. It is a diffusion generative model. DiffDock runs fast and gives confidence estimates with high accuracy.
I didn’t see many models related to drug binding on the Ersilia Model Hub. I did find Estate Molecular Descriptors
, Ersilia Compound Embeddings
, Chemical Checker Signaturizer
Human Plasma Protein Binding (PPB) of Compounds
and Avalon fingerprint
. This would be a good addition. Plus it’s state-of-the-art.
The DiffDock model can help scientists and researchers run simulations and learn how two or more molecules, such as a drug candidate and a receptor (protein) interact at the molecular level. (At this point, I already feel like a scientist)
Ersilia also focuses on Microsomal Stability (like the Rat liver microsomal stability model I worked on :wink:). Molecular docking can be useful in predicting the binding of a drug candidate to specific enzymes in the liver.
The authors used the molecular complexes in PDBBind that were extracted from the Protein Data Bank (PDB). They used the time-split of PDBBind with 17k complexes from 2018 or earlier for training/validation and 363 test structures from 2019 with no ligand overlap with the training complexes. The dataset is found on zenodo. The files were preprocessed with Open Babel. Then they used the reduce library to add potentially missing hydrogens, correct hydrogens, and correctly flip histidines.
They used convolutional networks. The architecture is broken into the embedding layer, the interaction layer, and the output layer.
git clone https://github.com/gcorso/DiffDock.git
conda create --name diffdock python=3.9
conda activate diffdock
conda install pytorch==1.11.0 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html
python -m pip install PyYAML scipy "networkx[default]" biopython rdkit-pypi e3nn spyrmsd pandas biopandas
Since I’m using a CPU, I’ll create a Conda environment.
conda create --name diffdock python=3.9
conda activate diffdock
Install PyTorch without CUDA (for CPU support):
conda install pytorch==1.11.0 cpuonly -c pytorch
Install the remaining Python packages with pip:
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cpu.html
python -m pip install PyYAML scipy "networkx[default]" biopython rdkit-pypi e3nn spyrmsd pandas biopandas
I’ll install ESM for both protein sequence embeddings and for the protein structure prediction (in case you only have the sequence of your target).
The OpenFold (and so ESMFold) requires a GPU. Since I don't have a GPU, I can still use DiffDock with existing protein structures. Another option for me would be to use Google Colaboratory or a Linux EC2 instance on AWS with a GPU.
DiffDock supports multiple input formats depending on whether you want to make predictions for a single molecule complex or for many at once.
The protein inputs need to be .pdb
files or sequences that will be folded with ESMFold. The ligand input can either be a SMILES string or a file type that RDKit can read like .sdf
or .mol2
.
- For a single complex: specify the protein with --protein_path protein.pdb
or --protein_sequence GIQSYCTPPYSVLQDPPQPVV
and the ligand with --ligand ligand.sdf
or --ligand "COc(cc1)ccc1C#N"
- For many complexes: create a CSV file with paths to proteins and ligand files or SMILES. It contains as columns complex_name
(name used to save predictions, can be left empty), protein_path
(path to .pdb file, if empty uses sequence), ligand_description
(SMILE or file path) and protein_sequence (to fold with ESMFold in case the protein_path is empty). An example .csv
is at data/protein_ligand_example_csv.csv
and you would use it with --protein_ligand_csv protein_ligand_example_csv.csv
.
And you are ready to run inference:
python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
The DiffDock model can also be retrained.
@DhanshreeA I've added my week 3 tasks.
I came across another model while researching. I enjoyed reading the research paper. It was interesting. I learned about antimicrobial peptides.
The research paper is here.
The GitHub repo is here.
Antimicrobial peptides are molecules in our bodies that protect us from harmful microbes (bacteria, viruses, etc.). They are effective in fighting antibiotic resistance of bacteria. While these molecules can protect us from microorganisms, some of them are actually harmful to us.
This research work was published in November 2021. They used an updated dataset to train a machine learning model to predict the toxicity of antimicrobial peptides. They applied feature selection to extract the key features behind the toxicity of antimicrobial peptides. After using this feature selection technique, the trained hybrid model had a performance with a recall of 87.6% and an F1-score of 84.9%.
Tag: ’cytotoxicity', 'antibacterial activity' Task: Classification Mode: Pretrained Input Shape: List Output Shape: List Input: Compound Output: List Output Type: Boolean
The DBAASP dataset was used for this research project. The dataset gives access to the latest experimental data of antimicrobial peptides, antimicrobial activity and toxicity. The toxicity types included in the data are HC50, CC50, and MIC.
The authors goal wasn't just to classify antimicrobial peptides as toxic and non-toxic. They also wanted to know the properties responsible for the toxicity. The properties could be based on either the amino acid sequence of peptides or the physico-chemical nature.
The Propy
package was used to extract 1541 features from the peptide sequence. They used two methods for feature selection; L1-SVM and Tree-Based feature selection by cross-validation (5-fold). They were able to reduce the features to 90 from 1276.
The SVC (rbf), LinearSVC, Random Forest, KNN, and hybrid models were trained and optimized on the training data. GridSearchCV (10 fold) was used to optimize the models by selecting the best hyperparameters for each model. The models were evaluated using precision, recall, f1-score, AUC and hamming distance.
Researchers are designing antimicrobial peptides to make them more harmful to microorganisms and less harmful to human cells. This model can help them do that. It would make a good addition to Ersilia’s model catalog.
The following packages are required to run the model:
Requirements Version
scikit-learn 0.22
numpy 1.17.4
jupyter 1.0.0
jupyter-client 5.3.4
jupyter-console 6.0.0
jupyter-core 4.6.1
ipc 1.0
pandas 0.25.3
propy3 1.0.0a2
I'd create a new virtual environment with Conda and install the packages mentioned above.
From the repo, the model checkpoints are already available.
The ReadMe.txt
file gives instructions on how to run the model.
1- Run the RunMe.ipynb
Jupyter notebook file.
2- Put your peptide sequences in the b list as an element.
The input sample is shown below:
b=["GFVDFLKKVAGTIAN","FLGGLIKIAMICAVTKKC","AGCSGVAHTRFGSSACNPFGWK","KKGLAKKWAGLKLAGLA
3- Run the next cell It contains this code;
%run -i toxicityCalculator
Random Forest Classifier
1 ['non-toxic']
2 ['toxic']
3 ['non-toxic']
4 ['non-toxic']
-------------------------
Support Vector Classifier
1 ['non-toxic']
2 ['toxic']
3 ['non-toxic']
4 ['non-toxic']
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application
Week 1 - Get to know the community
Install the Ersilia Model Hub and test the simplest model
I followed the instructions in this link to install the Ersilia Model Hub.
Since I'm using the Ubuntu operating system, I didn't have to use WSL.
Install Python and Conda
I already have Python and Conda installed on my system.
You can use this link to download Anaconda for Ubuntu. Installing Anaconda comes with a lot of packages and software, including
Python
andConda
.After downloading Anaconda, you can install it by opening a terminal in your downloads folder (where the file is) and using the code below.
Install GitHub CLI
I had an issue installing GitHub CLI with the code
conda install gh -c conda-forge
.There were conflicts with some of the packages in Anaconda. The issue was similar to this:
I was stuck here for a while. I checked online for solutions, but nothing helped. I realized it was because my version of Anaconda was outdated. Since I couldn't update it from the command line, I uninstalled it, downloaded and installed the latest version
Anaconda3-2023.09-0-Linux-x86_64.
After installing the latest version of Anaconda, I tried installing GitHub CLI again. It was successful this time.
Install Docker
I followed the instructions here to install docker on my system.
Note: In step 3, replace the placeholders
<version>
and<arch>
with the actual version and arch, to avoid errors. For example:Install Ersilia
I ran this code to test that Ersilia works well.
This is the output I got, which shows it works well.
I've successfully installed the Ersilia Model Hub.