ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Iniabasi Affiah #841

Closed Inyrkz closed 8 months ago

Inyrkz commented 9 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Week 1 - Get to know the community

Install the Ersilia Model Hub and test the simplest model

I followed the instructions in this link to install the Ersilia Model Hub.

Since I'm using the Ubuntu operating system, I didn't have to use WSL.

Install Python and Conda

I already have Python and Conda installed on my system.

You can use this link to download Anaconda for Ubuntu. Installing Anaconda comes with a lot of packages and software, including Python and Conda.

After downloading Anaconda, you can install it by opening a terminal in your downloads folder (where the file is) and using the code below.

bash Anaconda3-2023.09-0-Linux-x86_64

Install GitHub CLI

I had an issue installing GitHub CLI with the code conda install gh -c conda-forge.

There were conflicts with some of the packages in Anaconda. The issue was similar to this:

The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/osx-64::holoviews==1.15.0=py39hecd8cb5_0
  - defaults/noarch::babel==2.9.1=pyhd3eb1b0_0
  - defaults/osx-64::anaconda-project==0.11.1=py39hecd8cb5_0
  - defaults/osx-64::jupyterlab==3.4.4=py39hecd8cb5_0
  - defaults/osx-64::datashader==0.14.1=py39hecd8cb5_0
  - defaults/osx-64::anaconda==2022.10=py39_0
  - defaults/osx-64::hvplot==0.8.0=py39hecd8cb5_0
  - defaults/osx-64::bkcharts==0.2=py39hecd8cb5_1
  - defaults/osx-64::conda-build==3.22.0=py39hecd8cb5_0
  - defaults/osx-64::anaconda-navigator==2.4.0=py39hecd8cb5_0
  - defaults/osx-64::sphinx==5.0.2=py39hecd8cb5_0
  - defaults/osx-64::_ipyw_jlab_nb_ext_conf==0.1.0=py39hecd8cb5_1
  - defaults/osx-64::statsmodels==0.13.2=py39hca72f7f_0
  - defaults/osx-64::dask==2022.7.0=py39hecd8cb5_0
  - defaults/osx-64::anaconda-client==1.11.0=py39hecd8cb5_0
  - defaults/noarch::jupyterlab_server==2.10.3=pyhd3eb1b0_1
  - defaults/osx-64::numpydoc==1.4.0=py39hecd8cb5_0
  - defaults/noarch::intake==0.6.5=pyhd3eb1b0_0
  - defaults/osx-64::pandas==1.4.4=py39he9d5cce_0
  - defaults/osx-64::jupyter==1.0.0=py39hecd8cb5_8
  - defaults/osx-64::conda-repo-cli==1.0.20=py39hecd8cb5_0
  - defaults/noarch::seaborn==0.11.2=pyhd3eb1b0_0
  - defaults/osx-64::spyder==5.3.3=py39hecd8cb5_0
  - defaults/noarch::xarray==0.20.1=pyhd3eb1b0_1

I was stuck here for a while. I checked online for solutions, but nothing helped. I realized it was because my version of Anaconda was outdated. Since I couldn't update it from the command line, I uninstalled it, downloaded and installed the latest version Anaconda3-2023.09-0-Linux-x86_64.

After installing the latest version of Anaconda, I tried installing GitHub CLI again. It was successful this time.

github cli login successful

Install Docker

I followed the instructions here to install docker on my system.

  1. Set up Docker's package repository.
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
  1. Download the latest Docker DEB package.
  2. Install the package with the apt.
sudo apt-get update
sudo apt-get install ./docker-desktop-<version>-<arch>.deb

Note: In step 3, replace the placeholders <version> and <arch> with the actual version and arch, to avoid errors. For example:

sudo apt-get update
sudo apt-get install ./docker-desktop-4.24.0-amd64.deb

Install Ersilia

I ran this code to test that Ersilia works well.

ersilia -v fetch eos3b5e
ersilia serve eos3b5e
ersilia -v api run -i "CCCC"

This is the output I got, which shows it works well.

Successfully Install the Ersilia Model Hub

I've successfully installed the Ersilia Model Hub.

Inyrkz commented 9 months ago

Motivational Statement

I am a data scientist eager to explore the world of open source. I'm from Nigeria. I’ve never worked on an open-source project before, I find it confusing and complicated. My friend, Love, encouraged me to give it a try. She introduced me to Outreachy and told me she learned a lot from participating in Outreachy last year. That’s why I joined Outreachy. The more I learn about open-source, the more enthusiastic I get. I want to contribute my skills and learn from the experience.

I am proficient in Python, R, SQL, Git, Scikit-Learn, TensorFlow, Keras, etc. I’m currently learning MLOps. I want to learn more about model deployment and monitoring.

Why Ersilia?

  1. One of my favourite applications of AI is in the Healthcare industry. I’m interested in using AI to save lives. I love the work Ersilia is doing, using AI to discover drugs against neglected infectious diseases, and making the process easy for scientists to use these models in their research. I believe this work will be useful to a lot of people, especially Africans because we suffer the impact of diseases a lot. Like in the case of Ebola and COVID-19, our government couldn't handle it well. This is what inspires me to apply AI in the healthcare industry.

  2. I have some experience in doing cancer research with genome data. In my internship as a machine learning engineer at Rayca Precision, I got to work on using machine learning to fight cancer using patients’ gene expression matrix data. Rayca Precision is a startup company dedicated to accelerating drug discovery, reshaping precision oncology, and elevating the understanding of intricate biological systems.

I worked on different projects including classifying the cancer histological types, detecting keratinizing squamous cell carcinoma, predicting lymphovascular invasion, etc. I also got to read research papers and try to implement state-of-the-art models from the research papers. At first, it was challenging because I didn't have any background in oncology. But I put in the work to get the job done. I was able to help the company set up and run an open-source model. I found this work exciting and fulfilling. I would love to contribute to a team dedicated to impacting biomedical research.

  1. As an undergrad, my lecturers and professors gave me an opportunity to collaborate with the Computer Engineering Master's and PhD students at the University of Uyo, to teach and help them implement machine learning research projects. This involved reading a lot of machine learning research papers to figure out how to improve existing models. I got to apply machine learning to different fields. I enjoy seeing how different people use machine learning in research. This will be a great opportunity for me to review biomedical-related research papers.

  2. In my MLOps journey, one of the skills I want to learn is Docker. I see that it is a major tool that Ersilia uses. Working with Ersilia will not only help me learn docker but also apply it in real-world projects.

  3. This one is a bit personal. My friend and my mom depend on drugs for their health. I know there are millions of people around the world who need drugs to survive. Ersilia is helping scientists make the drug discovery process faster and easier. I would love to help out anyway I can.

Participating in this internship will definitely help advance my career. It will:

After this internship, I plan to continue contributing to Open-Source projects with the skills and experience I gain here.

I am grateful for the opportunity to participate in Outreachy. I will do my best to make impactful contributions.

DhanshreeA commented 9 months ago

Thank you for the updates @Inyrkz you can get started with week 2 tasks.

Inyrkz commented 9 months ago

Alright, I'll get started.

Inyrkz commented 9 months ago

Week 2 - Install and run an ML model

Select a model from the suggested list

I picked the first model, NCATS Rat Liver Microsomal Stability. When I was a child, I heard how scientists usually test drugs on animals first to know if it's good, before testing them on humans. When I saw Rat in the project title, I was curious.

After reading the research papers by the authors Vishal et al. here. I learned that they were applying machine learning to predict compounds' stability in rat liver microsomes. It’s a classification problem where the model predicts if the compound belongs to either the Stable or Unstable class.

They used the Scikit-Learn library in Python and used the 5-fold cross-validation technique for evaluation. They used the random forest classifier, artificial neural networks, a graph convolutional neural network (I’ve never used this before. It picked my interest.), and a recurrent neural network algorithm for training.

Install the model in your system

I forked their repo and cloned it using the command git clone --recursive https://github.com/ncats/ncats-adme.git

They mentioned that the –recursive flag should be used when cloning the repo.

Setting up the project required anaconda or miniconda. I have anaconda set up on my system.

I open my Terminal on Ubuntu. I was to navigate to the path with the folder ADME_RLM, and then into the server directory. But I didn’t find the ADME_RLM directory. I found the server directory.

cd ncats-adme
cd server

I created a virtual environment and installed the required packages in the environment.yml file.

conda env create --prefix ./env -f environment.yml

I activated the virtual environment with the code below.

conda activate ./env

Run predictions for the EML

I ran the application with the code

python app.py

Then I open it on Google Chrome by browsing to localhost http://127.0.0.1:5000/

Testing RLM Stability model

It displayed the predictions of the Rat Liver Microsomal Stability and Human Liver Cytosolic Stability for all the records in the file.

RLM prediction

HLC prediction

Images showing prediction results

The prediction result has three tables; the molecule (which shows the molecule diagram), the predicted class (showing the prediction class, 0 or 1, and the confidence of the prediction), and the prediction. Class 0 represents stable. It means the molecule is stable. Class 1 represents unstable. It means the molecule is unstable. From the first image, the first prediction shows 0 (0.95). This means the model is 95% confident that the molecule is stable.

I ran predictions on the Human Liver Cytosolic Stability too because I was curious. I wanted to see if the compounds that are stable in rats are also stable in humans. It turns out the second compound was stable in rats but unstable in humans. Also, the model's confidence for the Human Liver Cytosolic Stability is much lower than that of the Rat Liver Microsomal Stability.

Compare results with the Ersilia Model Hub implementation!

I navigated to the Ersilia Model Hub. I clicked on the tab Microsomal stability. It narrowed down the search to Human Liver Microsomal Stability and Rat liver microsomal stability. I clicked on the Rat liver microsomal stability and was redirected to the Ersilia GitHub repo of the model. I read the README.md in the repo and clicked on the DockerHub link.

Ersilia RLM GitHub

I found the ersilia identifier of the model from the GitHub repo, eos5505.

The Rat liver microsomal stability model code is eos5505.

I opened a terminal and ran the code below to fetch the model.

ersilia -v fetch eos5505

I got this error.

Traceback (most recent call last):
  File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/client.py", line 214, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
                        ^^^^^^^^^^^^^^
...

  File "/home/affiah/anaconda3/lib/python3.11/site-packages/docker/api/client.py", line 221, in _retrieve_server_version
    raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Based on the discussion in the Slack channel, I decided to use a VPN. I used the ProtonVPN, and the model fetching was successful.

Output:

23:49:41 | DEBUG    | Schema: {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'outcome': {'type': 'numeric_array', 'shape': (1,), 'meta': ['rlm_proba1']}}}
23:49:41 | DEBUG    | Done with the schema!
23:49:41 | DEBUG    | This is the schema {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'outcome': {'type': 'numeric_array', 'shape': (1,), 'meta': ['rlm_proba1']}}}
23:49:41 | DEBUG    | API schema saved at /home/affiah/eos/dest/eos5505/api_schema.json
23:49:53 | DEBUG    | Fetching eos5505 done in time: 0:19:30.772344s
23:49:53 | INFO     | Fetching eos5505 done successfully: 0:19:30.772344
👍 Model eos5505 fetched successfully!

I served the model with the code.

ersilia serve eos5505

Output:

🚀 Serving model eos5505: ncats-rlm

   URL: http://127.0.0.1:59227
   PID: 39560
   SRV: conda

👉 To run model:
   - run

   These APIs are also valid:
   - predict

💁 Information:
   - info

I ran the model using the code below.

ersilia api run -i ~/Desktop/eml_canonical.csv -o output.csv

I got this error:

  File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 353, in api
    return self.api_task(
           ^^^^^^^^^^^^^^
  File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 368, in api_task
    for r in result:
  File "/home/affiah/Downloads/ersilia/ersilia/core/model.py", line 195, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/affiah/Downloads/ersilia/ersilia/serve/api.py", line 319, in post
    for res in self.post_unique_input(
  File "/home/affiah/Downloads/ersilia/ersilia/serve/api.py", line 290, in post_unique_input
    or not schema.is_h5_serializable(api_name=self.api_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/affiah/Downloads/ersilia/ersilia/serve/schema.py", line 91, in is_h5_serializable
    schema = self.get_output_by_api(api_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/affiah/Downloads/ersilia/ersilia/serve/schema.py", line 88, in get_output_by_api
    return self.schema[api_name]["output"]
           ~~~~~~~~~~~^^^^^^^^^^
KeyError: 'run'

I decided to just test on one of the observations with the code below.

ersilia run -i "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"

Output:

{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            0.595
        ]
    }
}

It works!

I just need to figure out how to make the CSV file run.

I tried the command ersilia run -i ~/Desktop/eml_canonical.csv -o output.csv and it worked!

Here's the data in the output.csv file:

key input rlm_proba1
MCGSCOLBFJQGHM-SCZZXKLOSA-N Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 0.049
GZOSMCIZMLWJML-VJLLXTKPSA-N C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 0.595
BZKPWHYZMXOIDC-UHFFFAOYSA-N CC(=O)Nc1sc(nn1)S(=O)=O 0
QTBSBXVTEAMEQO-UHFFFAOYSA-N CC(O)=O 0
PWKSKIMOESPYIA-BYPYZUCNSA-N CC(=O)NC@@HC(O)=O 0
BSYNRYMUTXBXSQ-UHFFFAOYSA-N CC(=O)Oc1ccccc1C(O)=O 0.13
MKUXAQIIEYXACX-UHFFFAOYSA-N NC1=NC(=O)c2ncn(COCCO)c2N1 0.001
ASMXXROZKSBQIH-VITNCHFBSA-N OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 0.983
ULXXDDBFHOBEHA-CWDCEQMOSA-N CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 0.305
HXHWSAZORRCQMX-UHFFFAOYSA-N CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 0.542
OFCNXPDARWKPPY-UHFFFAOYSA-N O=C1N=CN=C2NNC=C12 0
YVPYQUNUQOZFHG-UHFFFAOYSA-N CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I 0.003
LKCWBDHBTVXHDL-RMDFUYIESA-N NCCC@HC(=O)N[C@@H]1CC@HC@@HC@H[C@H]1O[C@H]3OC@HC@@HC@H[C@H]3O 0.003
XSDQTOBWRPYKKA-UHFFFAOYSA-N NC(N)=NC(=O)c1nc(Cl)c(N)nc1N 0.013
IYIKLHRQXLHMJQ-UHFFFAOYSA-N CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 0.954

Observations

Ersilia's model predictions match the original model predictions.

Install and run Docker!

I installed Docker when I set up Ersilia Model Hub in the week 1 task. I installed Docker Desktop for Ubuntu from here. I created an account on the docker hub by using this link.

julietowah commented 9 months ago

hi @Inyrkz I ran the application with the code with python app.py but its been loading various models for different predictions, for a long time now hope am still on the write track

Inyrkz commented 9 months ago

Hi @julietowah, you are on the right track. It will take a while to download and load the models. I think the models are about 5GB in size. Just keep an eye on it. If it stops running without showing you the 127.0.0.1 link to navigate to, then run the python app.py command again.

Once it's done, you'll see this link http://127.0.0.1:5000/. Navigate to it on your browser and you'll be good to go.

julietowah commented 9 months ago

thank you so much

Inyrkz commented 9 months ago

You are welcome

DhanshreeA commented 8 months ago

Hi @Inyrkz thank you for the detailed updates, and all the debugging efforts. You can continue your week 3 tasks in this issue.

Inyrkz commented 8 months ago

You're welcome. I'm always happy to help. I'll add my week 3 task.

Inyrkz commented 8 months ago

Week 3 - Propose new models

Proposed Model 1: Quantitative Toxicity Prediction via Meta Ensembling of Multitask Deep Learning Models

Link to paper: here Link to GitHub repo: here

Short Story

Here’s a short story. My friend, Samuel, was sick. His doctor recommended some drugs. We went to find the drugs but most of the pharmacies around the area didn’t have them. We finally found the drugs at a pharmacy. The pharmacist, Idara, looked at the drug list and from the expression on her face, I knew these drugs were a big deal. They were drugs she couldn’t give out without a prescription. She had to fill out some forms first before she could sell the drugs.

She had a cool sense of humour. She didn’t want my friend, Samuel, to feel weird about buying those drugs. I remember at some point she called the drugs poison. This part got my attention. I wondered why a drug designed to help someone could be a poison. I did a little research about drugs and found out about the toxic properties of drugs. Some drugs are toxic which is why they have to be taken in the recommended dosage. Before a drug candidate can be approved it must be screened to make sure it is safe for consumption.

Task: Regression Tag: Antimicrobial activity, Antiviral activity, toxicity Mode: Pretrained. Input Shape: Single Output Shape: List Input: Compound Output: Score Output Type: Float

Relevance to Ersilia

People have good intentions when they make drugs, but sometimes the drugs they make can be toxic to people. This is why I’m also interested in this project about toxicity. It is about predicting the toxicity of substances. When scientists are researching drugs to make, they can use AI to check the toxicity of the substances they are using to make the drugs.

I noticed Ersilia have similar models about toxicity prediction including: Toxicity prediction across the Tox21 panel with semi-supervised learning, Toxicity and synthetic accessibility prediction, ToxCast toxicity panel, Toxicity at clinical trial stage, HepG2 Toxicity - MMV, and S2DV HepG2 toxicity. Adding this model to the collection would expand Ersilia’s Hub. Researchers will have more options.

The authors call their model QuantitativeTox (That is a cool name). They trained it on four datasets: LD50, IGC50, LC50, and LC50-DM. These datasets contain information about how toxic substances are. They trained their model using an ensemble of five different deep-learning models. In simpler terms, they combined five deep learning models into one.

It’s like the concept of two heads are better than one.

This is more robust and powerful. To be sure of the model’s performance, they compared it with the best existing model TopTox (another cool name). QuantitativeTox performed better than TopTox in three out of four datasets. This proves that their technique beat the existing best model.

Researchers can use this model for four kinds of prediction, including: LD50, IGC50, LC50, and LC50-DM. I did some research to find the meaning of the terms. LD50 means lethal dose for 50% of the population. The lower the LD50 value, the more toxic the substance is considered.

From the research paper, I learned that the LD50 dataset gives information about the amount of chemical substance needed to cause death in 50% of a group of rats when the chemical is given to them orally (like swallowing it). Usually, when a chemical is taken orally, it causes less harm compared to when it's injected directly into the bloodstream. The LD50 dataset helps us understand how toxic a substance is when ingested orally.

IGC50 means Inhibitory Growth Concentration for 50%. That is the concentration of an antimicrobial agent, e.g. antibiotics, required to prevent the growth of a bacterial culture by 50%. This is for seeing the toxicity to microorganisms. The lower the IGC50 value the more powerful the substance is. From the research paper, the IGC50 dataset shows the concentration of a chemical compound to arrest the growth of Tetrahymena pyriformis when exposed for 40 hours.

LC50 means Lethal Concentration for 50% of the Population. It is used for gases or airborne particles. It represents the concentration of a substance in the air that, when inhaled by test subjects, results in the death of 50% of the subjects. The dataset gives information on the toxicity of a given compound on fathead minnow, a species of temperate freshwater fish after 96 hours of exposure.

LC50-DM means Lethal Concentration for 50% of the Population with Direct Mortality It is similar to the LC50. The dataset used gives details of the concentration of a compound in water in milligrams per litre causing 50% population of Daphnia maga to die after 48 hours.

This model isn’t just limited to predicting the toxicity of drug candidates. It also helps predict the toxicity of substances to microbes like bacteria.

More Details

From the units of the toxicity measure, I can tell that it is a regression problem. The original datasets are found in Ecotox and chemidheavy. The models were evaluated using three metrics; R Squared, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

Implementation of the Model

I love how the model usage is documented in their GitHub repo here.

I would start by cloning the repo.

git clone https://github.com/Abdulk084/QuantitativeTox

Navigate into the repository.

cd QuantitativeTox

The model was tested on Ubuntu 20.04 with Python 3.7.7. I use the Ubuntu 22.04 operating system, so this won’t be an issue. They also use conda. The next step is to restore the environment. The environment.yml file is available for this.

conda env create -f environment.yml

I activate the virtual environment using the command below.

conda activate qtox

Install PyBioMed.

cd PyBioMed
python setup.py install
cd ..

After the installation of QuatitativeTox, it can be tested for the four various tasks: LD50, IGC50, LC50, and LC50-DM.

To test on the LD50 task, run the command.

cd LD50
python LD50_test.py

The output is a CSV file with the name LD50_test_results.csv.

To test on the IGC50 task, run the command.

cd ..
cd IGC50
python IGC50_test.py

The output is a CSV file with the name IGC50_test_results.csv.

To test on the LC50 task, run the command.

cd ..
cd LC50
python LC50_test.py

The output is a CSV file with the name LC50_test_results.csv.

To test on the LC50DM task, run the command.

cd ..
cd LC50DM
python LC50DM_test.py

The output is a CSV file with the name LC50DM_test_results.csv.

Here’s a sample of what the output file looks like.

pred_test_ext_stack_load_IGC50 test_ext_IGC50_meta_r2 test_ext_IGC50_meta_mae test_ext_IGC50_meta_rmse
2.7041178 0.8611876765738696 0.26909787510227223 0.3659300114316411
1.3349031
4.9750233
2.391333

I didn’t see any information on the structure of the input file. So I dug a little deeper. In the research paper, I saw that they used the preprocessed train and test sets, which are pairs of SMILES strings and toxicity measures, from TopTox. When I opened the LD50_test.py Python script, I found out they used TensorFlow for their implementation. TensorFlow is my favourite deep learning framework. The input file is also a CSV file named external_test.csv.

The model checkpoints are provided.

Inyrkz commented 8 months ago

Proposed Model 2: EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

The research paper can be found here. The GitHub repo is here

Short Story

I've always wondered how a drug works. How is it that a drug you take for headaches, like paracetamol, finds its way to your head to cure your headache from your stomach? In my biology class, we were taught about the digestive system and how the body digests food. But no one mentioned anything about drugs :eyes: .

I keep learning new things as I research.

Fun fact: Drug discovery is an expensive process. It costs around 1 billion dollars to make a single drug. It can take up to 10 years of development and testing before a drug can be FDA-approved.

Task: Regression Tag: Drug-likeness, Molecular weight, Permeability, Similarity, Synthetic accessibility Mode: Pretrained. (The model can also be retrained) Input Shape: Pair of lists Output Shape: List Input: Compound Output: Descriptor
Output Type: Float

Why?

Knowing how a drug molecule interacts with a specific protein is a big challenge in drug discovery. This paper introduces EquiBind, a method to predict the binding of molecules to their target proteins, including the location and orientation of the binding. It also focuses on the speed of predicting drug binding structure, as fast models help with fast virtual screening or drug engineering. EquiBind is really fast and it is better than traditional standards in terms of quality.

A major problem this paper addresses is understanding how drug-like molecules (ligands) interact and form complexes (structures formed from combining molecules) with target proteins (receptors) – drug binding – which is a requirement for virtual screening. Solving this problem will go a long way in drug discovery.

image_from_research_paper Figure 1. High-level overview of the structural drug binding problem tackled by EQUIBIND Source: Research paper

From the image above, the process begins with a molecular representation (ligand) in the form of a graph and a 3D shape of a random molecule, generated by the program RDKit/ETKDG, when it's not attached to anything. This work only models the flexibility of the ligand and assumes that the protein is rigid.

Relevance to Ersilia

After checking the Ersilia Model Hub, I found a limited number of models relating to the prediction of a drug-binding structure. I saw 3 models on drug-likeness and 7 models on similarity. This drug-binding prediction is useful in drug discovery. It will help scientists quickly identify potential drug candidates and how they interact with specific proteins.

Having more models relating to drug binding on Ersilia would go a long way in speeding up how long it takes to discover new drugs.

Dataset

The authors used a new time-based dataset split and preprocessing pipeline for this project. They used the protein-ligand complexes from PDBBind. PDBBind is a subset of the Protein Data Bank (PDB) that provides 3D structures of individual proteins and complexes. The latest version, PDBBind v2020, contains 19,443 protein-ligand complexes with 3,890 unique receptors and 15,193 unique ligands.

Their test set contained complexes that were discovered in 2019 or after. The training set and validation set only used older complexes. For the data preprocessing, they dropped all complexes that couldn’t be processed by the RDKit library. The data was reduced from 19,443 protein-ligand complexes to 19,119 complexes. Each ligand and receptor was processed with OpenBabel.

Model Implementation

A deep neural network algorithm was used to build EquiBind. The authors optimized the model using the Adam optimizer. They employed early stopping with a patience of 150 epochs based on the percentage of predicted validation set complexes with an RMSD better than 2A. The hidden dimension is (32, 64, 100). They used the following activation functions: Leaky-RELU, ReLU, SeLU. Dropout was applied (0, 0.05, 0.1, 0.2). They applied the following normalization technique: BatchNorm, LayerNorm, and GraphNorm.

Model Usage

Setting Up the Environment

  1. Clone GitHub: git clone https://github.com/HannesStark/EquiBind The processed dataset for the project is available on zenodo. To use it, I can download it. Then unzip it and put it in the data directory of the repo.

Inputs: The ligand files can be of the formats .mol2, .sdf, .pdbqt or .pdb whose names contain the string ligand (your ligand files should contain all hydrogens). The receptor files are of the format .pdb whose names contain the string protein. For each complex we want to predict we need a directory containing the ligand and receptor file. Like this:

my_data_folder
└───name1
    │   name1_protein.pdb
    │   name1_ligand.sdf
└───name2
    │   name2_protein.pdb
    │   name2_ligand.mol2
  1. I’d create a new environment with all required packages using environment.yml. Since I’ll be using CPU, I’ll run the code conda env create -f environment_cpuonly.yml.

  2. Activate the virtual environment conda activate equibind. Packages Required: These are the model requirements.

python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard
  1. Predict Binding Structures In the config file configs_clean/inference.yml I’d set the path to our input data folder inference_path: path_to/my_data_folder. Then run:
python inference.py --config=configs_clean/inference.yml

Our results would be saved as .sdf files in the directory specified in the config file under `output_directory: 'data/results/output' and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt

We can also run inference for multiple ligands in the same .sdf file and a single receptor with the code.

python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf

This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The outputs are 3 files in output_directory with the following names and contents:

failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled. success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded. output.sdf - contains the conformers produced by EquiBind in .sdf format.

Inyrkz commented 8 months ago

Proposed Model 3: DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

The research on EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction led me to this project. The authors of EquiBind also came up with DiffDock.

The research paper can be found here. The GitHub repo is here.

Task: Generative Tag: Drug-likeness, ADME, Permeability, Similarity, Synthetic accessibility, Microsomal stability Mode: Pretrained. (The model can also be retrained) Input Shape: Single Output Shape: List Input: Compound Output: Descriptor
Output Type: Float

About Model

At first sight, the project topic looked weird. Then I did some research on molecular docking. I learned that It involves the simulation of how two or more molecules, typically a small ligand (such as a drug candidate) and a receptor (such as a protein), interact at the molecular level. It is used to predict and study the binding interactions between these molecules.

Image of the Docking Process Image from Wikipedia

DiffDock is a state-of-the-art model for molecular docking. New deep learning methods that treat docking as a regression problem, when compared to traditional search-based methods, have a fast runtime, but not an improved accuracy. The creators of DiffDock frame molecular docking as a generative modelling problem. It is a diffusion generative model. DiffDock runs fast and gives confidence estimates with high accuracy.

Relevance to Ersilia

I didn’t see many models related to drug binding on the Ersilia Model Hub. I did find Estate Molecular Descriptors, Ersilia Compound Embeddings, Chemical Checker Signaturizer Human Plasma Protein Binding (PPB) of Compounds and Avalon fingerprint. This would be a good addition. Plus it’s state-of-the-art.

The DiffDock model can help scientists and researchers run simulations and learn how two or more molecules, such as a drug candidate and a receptor (protein) interact at the molecular level. (At this point, I already feel like a scientist)

Ersilia also focuses on Microsomal Stability (like the Rat liver microsomal stability model I worked on :wink:). Molecular docking can be useful in predicting the binding of a drug candidate to specific enzymes in the liver.

Dataset

The authors used the molecular complexes in PDBBind that were extracted from the Protein Data Bank (PDB). They used the time-split of PDBBind with 17k complexes from 2018 or earlier for training/validation and 363 test structures from 2019 with no ligand overlap with the training complexes. The dataset is found on zenodo. The files were preprocessed with Open Babel. Then they used the reduce library to add potentially missing hydrogens, correct hydrogens, and correctly flip histidines.

Model Implementation

They used convolutional networks. The architecture is broken into the embedding layer, the interaction layer, and the output layer.

Model Usage

Setting Up the Environment

  1. The environment setup requires the use of Anaconda. The repo can be cloned using the command below.
git clone https://github.com/gcorso/DiffDock.git
  1. Set up a working conda environment to run the code. This is the example they gave on setting it up. They advised we use the correct pytorch, pytorch-geometric, cuda versions or CPU-only versions.
conda create --name diffdock python=3.9
conda activate diffdock
conda install pytorch==1.11.0 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html
python -m pip install PyYAML scipy "networkx[default]" biopython rdkit-pypi e3nn spyrmsd pandas biopandas

Since I’m using a CPU, I’ll create a Conda environment.

conda create --name diffdock python=3.9
  1. Activate the Conda environment:
conda activate diffdock

Install PyTorch without CUDA (for CPU support):

conda install pytorch==1.11.0 cpuonly -c pytorch
  1. Install the remaining Python packages with pip:

    pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cpu.html
    python -m pip install PyYAML scipy "networkx[default]" biopython rdkit-pypi e3nn spyrmsd pandas biopandas
  2. I’ll install ESM for both protein sequence embeddings and for the protein structure prediction (in case you only have the sequence of your target).

The OpenFold (and so ESMFold) requires a GPU. Since I don't have a GPU, I can still use DiffDock with existing protein structures. Another option for me would be to use Google Colaboratory or a Linux EC2 instance on AWS with a GPU.

Running DiffDock on your own complexes

DiffDock supports multiple input formats depending on whether you want to make predictions for a single molecule complex or for many at once.

The protein inputs need to be .pdb files or sequences that will be folded with ESMFold. The ligand input can either be a SMILES string or a file type that RDKit can read like .sdf or .mol2.

- For a single complex: specify the protein with --protein_path protein.pdb or --protein_sequence GIQSYCTPPYSVLQDPPQPVV and the ligand with --ligand ligand.sdf or --ligand "COc(cc1)ccc1C#N"

- For many complexes: create a CSV file with paths to proteins and ligand files or SMILES. It contains as columns complex_name (name used to save predictions, can be left empty), protein_path (path to .pdb file, if empty uses sequence), ligand_description (SMILE or file path) and protein_sequence (to fold with ESMFold in case the protein_path is empty). An example .csv is at data/protein_ligand_example_csv.csv and you would use it with --protein_ligand_csv protein_ligand_example_csv.csv.

And you are ready to run inference:

python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise

The DiffDock model can also be retrained.

Inyrkz commented 8 months ago

@DhanshreeA I've added my week 3 tasks.

Inyrkz commented 8 months ago

Extra

I came across another model while researching. I enjoyed reading the research paper. It was interesting. I learned about antimicrobial peptides.

Proposed Model 4: Prediction of antimicrobial peptides toxicity based on their physicochemical properties using machine learning techniques

The research paper is here.

The GitHub repo is here.

Antimicrobial peptides are molecules in our bodies that protect us from harmful microbes (bacteria, viruses, etc.). They are effective in fighting antibiotic resistance of bacteria. While these molecules can protect us from microorganisms, some of them are actually harmful to us.

This research work was published in November 2021. They used an updated dataset to train a machine learning model to predict the toxicity of antimicrobial peptides. They applied feature selection to extract the key features behind the toxicity of antimicrobial peptides. After using this feature selection technique, the trained hybrid model had a performance with a recall of 87.6% and an F1-score of 84.9%.

Tag: ’cytotoxicity', 'antibacterial activity' Task: Classification Mode: Pretrained Input Shape: List Output Shape: List Input: Compound Output: List Output Type: Boolean

The DBAASP dataset was used for this research project. The dataset gives access to the latest experimental data of antimicrobial peptides, antimicrobial activity and toxicity. The toxicity types included in the data are HC50, CC50, and MIC.

The authors goal wasn't just to classify antimicrobial peptides as toxic and non-toxic. They also wanted to know the properties responsible for the toxicity. The properties could be based on either the amino acid sequence of peptides or the physico-chemical nature.

The Propy package was used to extract 1541 features from the peptide sequence. They used two methods for feature selection; L1-SVM and Tree-Based feature selection by cross-validation (5-fold). They were able to reduce the features to 90 from 1276.

The SVC (rbf), LinearSVC, Random Forest, KNN, and hybrid models were trained and optimized on the training data. GridSearchCV (10 fold) was used to optimize the models by selecting the best hyperparameters for each model. The models were evaluated using precision, recall, f1-score, AUC and hamming distance.

Why it would be relevant to Ersilia?

Researchers are designing antimicrobial peptides to make them more harmful to microorganisms and less harmful to human cells. This model can help them do that. It would make a good addition to Ersilia’s model catalog.

How would you implement it (look at its code and whether it is ready to be used, e.g. are the model checkpoints provided, is the underlying data available?)

The following packages are required to run the model:

Requirements        Version
scikit-learn         0.22
numpy                1.17.4
jupyter              1.0.0
jupyter-client       5.3.4
jupyter-console      6.0.0
jupyter-core         4.6.1
ipc                  1.0
pandas               0.25.3
propy3               1.0.0a2

I'd create a new virtual environment with Conda and install the packages mentioned above.

From the repo, the model checkpoints are already available.

The ReadMe.txt file gives instructions on how to run the model. 1- Run the RunMe.ipynb Jupyter notebook file. 2- Put your peptide sequences in the b list as an element. The input sample is shown below:

b=["GFVDFLKKVAGTIAN","FLGGLIKIAMICAVTKKC","AGCSGVAHTRFGSSACNPFGWK","KKGLAKKWAGLKLAGLA

3- Run the next cell It contains this code;

%run -i toxicityCalculator
  1. The output shows the result of the Random Forest and SVM models
    Random Forest Classifier
    1 ['non-toxic']
    2 ['toxic']
    3 ['non-toxic']
    4 ['non-toxic']
    -------------------------
    Support Vector Classifier
    1 ['non-toxic']
    2 ['toxic']
    3 ['non-toxic']
    4 ['non-toxic']
GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!