ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Pradnya #628

Closed Pradnya2203 closed 1 year ago

Pradnya2203 commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Pradnya2203 commented 1 year ago

Motivation Statement:

I first heard about Outreachy from a friend and was truly pleased by this idea of supporting diversity and encouraging the under-represented groups from all around the world. I am a sophomore at IIT Roorkee and am also a part of various student technical clubs related to software development and data science.

I was quite excited to know that my application was approved and while going through the projects I came across Ersilia which seemed very appealing for multiple reasons. Firstly the cause; providing medical resources to under-developed countries. I have always wanted to be help people using my skills and would be overwhelmed to contribute for such a cause. Secondly the tech-stack used suits me and would help me in my future goals to pursue a career in data science.

I have worked with various languages like python, Javascript, C++, PHP, MATLab and would like to get a strong hold on python during this internship period.

Ersilia will be a great opportunity to improve my skills as well as work for the betterment of society. I am really looking forward to contribute in this project and also learn a lot in the process.

GemmaTuron commented 1 year ago

Hi @Pradnya2203

Thanks for your interest and welcome to Ersilia! Please, if you have successfully installed Ersilia and run a test model, report it here and also let us know which system are you using. Thanks!

Pradnya2203 commented 1 year ago

Hey I am using ubuntu 22.04 and did run the sample model. We are supposed to fork the repository and then start contributing right?

GemmaTuron commented 1 year ago

Hi @Pradnya2203 !

Please read the guidelines for the contribution period. This time around in order to be able to better provide support to all applicants we have set up a set of defined tasks to be completed each week. https://ersilia.gitbook.io/ersilia-book/contributors/internships/outreachy-summer-2023

In addition, we will be handing out specific tasks to interns as soon as we know everyone is set up

GemmaTuron commented 1 year ago

Hi @Pradnya2203

As you will see in issue #343 this model seems to present some issues at fetch time. Please can you test it both using the CLI and the Google Colab template (use the template provided in /notebooks), report if it is working in either of the systems and the log files. When fetching the model, please collect the log files and try to identify the source of the error, if there are any.

Thanks!

Pradnya2203 commented 1 year ago

eos3ae_error.log

I don't exactly know why am I getting this error "ModuleNotFoundError: No module named 'yaml' " I tried installing pyyaml but didn't change anything, I'll try to solve it though Tested using Google Colab template as well but the model still doesn't work

AhmedYusuff commented 1 year ago

Hi @Pradnya2203. From your error log ('Connection aborted.', OSError(0, 'Error')) . This looks like your connection was abandoned by the Host. Probably due to a system Error from your end.

I also tried Fetching the Model on Ubuntu 22.04, but i had to terminate the process because it was taking too long.

neww.log

Pradnya2203 commented 1 year ago

Hey @AhmedYusuff, I was not actually facing that error, was able to get around with that one but I uploaded the old log file by mistake. I have now uploaded the now log file. Thanks a lot :)

AhmedYusuff commented 1 year ago

You are welcome @Pradnya2203.

In your Log file I can see your model failed when it tried to import yaml ModuleNotFoundError: No module named 'yaml'

You can use pip show pyyaml to see if you have yaml installed on your system.

Pradnya2203 commented 1 year ago

Yes I have tried that as well @AhmedYusuff

GemmaTuron commented 1 year ago

Hi @Pradnya2203

Important: did you activate the conda environment of the model to install yaml? you should first: conda activate eos3ae and then pip show pyyaml

Pradnya2203 commented 1 year ago

Hey @GemmaTuron I installed the module after activating the conda environment of the model, and checked it using pip show pyyaml, but I'm still getting the same error when I run the model and when I check again I see no pyyaml in the conda environment of the model. I'll try to fix it.

GemmaTuron commented 1 year ago

Hey @GemmaTuron I installed the module after activating the conda environment of the model, and checked it using pip show pyyaml, but I'm still getting the same error when I run the model and when I check again I see no pyyaml in the conda environment of the model. I'll try to fix it.

Hi @Pradnya2203 ! thanks, I'd suggest first focusing on week 2 tasks and if those are completed on time, then we'll tackle the extra tasks assigned to you :)

Pradnya2203 commented 1 year ago

The model I chose for week 2 was Smiles To IUPAC Translator. This model was particularly interesting to me as it converts a simplified representation of a molecule (SMILES) into a standardized format for naming chemical compounds (IUPAC). This type of translator would be extremely useful in the field of drug discovery, where understanding the chemical structure of molecules is crucial for developing new drugs. By being able to accurately translate SMILES into IUPAC, researchers can obtain important information about a molecule's properties. This information is essential for identifying potential drug targets, predicting how a molecule will interact with other compounds in the body, and designing new drug molecules that can better target specific diseases.

Pradnya2203 commented 1 year ago

I was able to fetch and serve it from the Ersilia Model Hub and get the following output

    "input": {
        "key": "POLCUAVZOMRGSN-UHFFFAOYSA-N",
        "input": "CCCOCCC",
        "text": "CCCOCCC"
    },
    "output": {
        "outcome": [
            "1-propoxypropane"
        ]
    }
}
Pradnya2203 commented 1 year ago

I than tried to actually install and run the original open source model which is https://github.com/Kohulan/Smiles-TO-iUpac-Translator#simple-usage To run the model I created a new file app.py which had the following code


from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

I edited this file to take input as "1-propoxypropane" and got the following result

SMILES of 1-propoxypropane is: CCCOCCC.CCCOCCC

I ran into certain issues, initially I couldn't figure out how to actually run it and when I did I got an error that "[Errno 0] JVM DLL not found" Solved this error using sudo apt install default-jre

Pradnya2203 commented 1 year ago

After running the model I used the given dataset to get the output. To use the dataset I first filtered out the IUPAC names of the molecules and created an array of strings and used a for loop to iterate and run the model on all the IUPAC names. I got the following output translate_reverse.txt

Pradnya2203 commented 1 year ago

STOUT model has two functionalities. They are: translate_forward and translate_reverse. translate_forward converts the SMILES to IUPAC and conversely translate_reverse converts IUPAC to SMILES. In the above comment it can be seen that translate reverse has been used. Now we will use translate_forward using can_smiles from the given dataset and get the following output. translate_forward.txt

GemmaTuron commented 1 year ago

Hi @Pradnya2203

Great, thanks for this work! Can I ask you as extra task to install the NCATS models (use the development branch of the repo) and test out the Human Cytosolic Stability model? @pauline-banye did a lot of work in the previous internship to implement the different NCATS models and I want to make sure those are all working :)

Many thanks!

Pradnya2203 commented 1 year ago

The last step was to run the model on Ersilia Model Hub on the dataset. For that I fetched and served the model: "STOUT: SMILES to IUPAC name translator". Now for the model to iterate over the entire dataset which is https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv, I first processed the data and chose the can_smiles column as my input. For that I made a bash script which ran on my CLI and gave the following output. output file: ersilia_output.txt The bash script was :

#!/bin/bash
s = ()
ersilia serve smiles2iupac
for n in ${s[@]}; 
do
    ersilia api -i $n
done

Here s contained the whole array of strings which was can_smiles

Pradnya2203 commented 1 year ago

The two outputs of the Smiles To IUPAC Translator by using original source code and ersilia model hub gives following results posted above. On comparing the two we can see the following results: For example for the input Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1 we get the output as IUPAC name of Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1 is: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol for original source code and

    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1",
        "text": "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}

for the ersilia model hub code

We can see that the output of the two matches. Similarly we can check for other inputs as well using the files posted above

Pradnya2203 commented 1 year ago

Problems I ran into while running the model on both original source code and using ersilia model hub:

Pradnya2203 commented 1 year ago

Hey @GemmaTuron, I have completed the week 2 tasks using the model Smiles To IUPAC Translator. I have documented all issues I faced during completion of the tasks and have posted the results of it as well. Apart from this model I also tried to run the NCATS model but was unable to setup the conda environment for it as it took a large amount of time to setup and got an error related to pip and HTTP connection. Got the same error even after retrying and making sure that the network connection is strong enough. I will try to set it up and again and continue the task as per your instructions. Also do I need to make any changes to my task 2 submission? Thank you

GemmaTuron commented 1 year ago

Hi @Pradnya2203

The tasks are fine, you can reach out to Masroor or Zakia who have also been working on the NCATS model. What I can suggest if you are having issues is to follow the environment.yml file manually, instead of running conda env create --prefix ./env -f environment.yml open the .yml file and install manually one by one the dependencies. This will tell you which ones are giving issues (go in order, and create a conda env with the right python version)

Pradnya2203 commented 1 year ago

Update: I was able to create the conda environment. The mistake I did before was not setting up chemprop. But app.py is giving errors.

Loading RLM graph convolutional neural network model
Traceback (most recent call last):
  File "app.py", line 20, in <module>
    from predictors.rlm.rlm_predictor import RLMPredictior
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 177, in <module>
    rlm_gcnn_scaler, rlm_gcnn_model, rlm_gcnn_model_version = load_gcnn_model()
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 148, in load_gcnn_model
    rlm_gcnn_scaler, _ = load_scalers(rlm_gcnn_scaler_path)
  File "/home/pradnya/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

After searching a bit about the error I realized that it's and error with the model so I tried solving it by making sure that chemprop is running well, it took sometime as the packages were not compatible with each other and there were some errors in installing certain modules. But I was able to fix them all and made sure that chemprop is running. But am still facing the same error with app.py. I will try to fix it soon.

Pradnya2203 commented 1 year ago

I think the issue is with accessing the models from ncat servers. On clicking any of the models I am redirected to this page image and on visiting the site mentioned I find this image

GemmaTuron commented 1 year ago

Hi @Pradnya2203 !

for the local implementation, you need to make sure you download the right model and place it in the folder manually, since the models cannot be accessed from the server (they stopped maintenance apparently). Use the links provided in the development branch

emmakodes commented 1 year ago

Update: I was able to create the conda environment. The mistake I did before was not setting up chemprop. But app.py is giving errors.

Loading RLM graph convolutional neural network model
Traceback (most recent call last):
  File "app.py", line 20, in <module>
    from predictors.rlm.rlm_predictor import RLMPredictior
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 177, in <module>
    rlm_gcnn_scaler, rlm_gcnn_model, rlm_gcnn_model_version = load_gcnn_model()
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 148, in load_gcnn_model
    rlm_gcnn_scaler, _ = load_scalers(rlm_gcnn_scaler_path)
  File "/home/pradnya/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

After searching a bit about the error I realized that it's and error with the model so I tried solving it by making sure that chemprop is running well, it took sometime as the packages were not compatible with each other and there were some errors in installing certain modules. But I was able to fix them all and made sure that chemprop is running. But am still facing the same error with app.py. I will try to fix it soon.

Hello @Pradnya2203 I found a fix for this. Download the model file manually from here:

  1. RLM - https://opendata.ncats.nih.gov/public/adme/models/archived/rlm/gcnn_model-20230201.pt
  2. PAMPA 7.4 - https://opendata.ncats.nih.gov/public/adme/models/archived/pampa/gcnn_model-20230201.pt
  3. SOL - https://opendata.ncats.nih.gov/public/adme/models/archived/solubility/gcnn_model-20230201.pt

and place them in their respective directory which is inside the models directory like this: ..\ncats-adme\server\models\rlm ..\ncats-adme\server\models\pampa

then run: python app.py

Pradnya2203 commented 1 year ago

Update: I manually downloaded the model file and placed it in the right folders and also installed the right version of every single package needed and I'm still getting the same error.

Pradnya2203 commented 1 year ago

Update: I was finally able to run the ncats-adme model after a lot of struggle. I was repeatedly getting the same error which is

Traceback (most recent call last):
  File "app.py", line 20, in <module>
    from predictors.rlm.rlm_predictor import RLMPredictior
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 177, in <module>
    rlm_gcnn_scaler, rlm_gcnn_model, rlm_gcnn_model_version = load_gcnn_model()
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 148, in load_gcnn_model
    rlm_gcnn_scaler, _ = load_scalers(rlm_gcnn_scaler_path)
  File "/home/pradnya/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

I tried everything, right from manually installing every package to going to the depths of the code to actually find the source of the error. Finally I realized that it was a really simple solution. There was somehow an auto-downloaded corrupt file which was the root cause of the error and just deleting it solved it. Now this might seem like a trivial issue, but I think it causes huge inconvenience as the file is auto-downloaded and the error barely tells anything about it and we keep getting UnpicklingError .

Pradnya2203 commented 1 year ago

After removing the corrupt file, I was able to run python app.py but then realised that my ubuntu does not have sufficient space and had to borrow some from windows and somehow a simple restart led to loss of data on ubuntu (no idea how), so had to set up ncats again and lastly app.py took a long long time to run but it's finally working well and I get the following output after putting input.csv as input. image

Pradnya2203 commented 1 year ago

So this is the result I get for Human Cytosolic Stability model. Also thanks a lot @emmakodes and @GemmaTuron for helping me out with this error

Pradnya2203 commented 1 year ago

The explanation of the output with smiles as input is: mol: Gives the structure of the molecule Tanimoto Similarity: It is the most popular similarity measure for comparing chemical structures represented by means of fingerprints is the Tanimoto (or Jaccard) coefficient T. Two structures are usually considered similar if T > 0.85 (for Daylight fingerprints).The Tanimoto algorithm states that A and B are sets of fingerprint “bits” within the fingerprints of molecule A and molecule B. AB is defined as the set of common bits of fingerprints of both molecule A and B. The resulting Tanimoto coefficient (or T(A,B)) ranges from 0, when the fingerprints have no bits in common, to 1, when the fingerprints are identical. Thus,

T(A,B) = (A ∩ B)/(A + B - A ∩ B)

The chemical similarity problem then becomes, Given molecule A, find all formulas that have a Tanimoto coefficient greater than a given threshold. The greater the value of a set threshold, the more similar the molecules are.

More information on Human Liver Cytosolic Stability: Over the last few decades, chemists have become skilled at designing compounds that avoid cytochrome P (CYP) 450 mediated metabolism. Typical screening assays are performed in liver microsomal fractions and it is possible to over‑ look the contribution of cytosolic enzymes until much later in the drug discovery process. Few data exist on cytosolic enzyme‑mediated metabolism and no reliable tools are available to chemists to help design away from such liabili‑ ties. ML models have helped to develop in silico classifiers based on the human cytosol stability data to facilitate identification of potential substrates during the lead optimization phase.

pauline-banye commented 1 year ago

Update: I was finally able to run the ncats-adme model after a lot of struggle. I was repeatedly getting the same error which is

Traceback (most recent call last):
  File "app.py", line 20, in <module>
    from predictors.rlm.rlm_predictor import RLMPredictior
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 177, in <module>
    rlm_gcnn_scaler, rlm_gcnn_model, rlm_gcnn_model_version = load_gcnn_model()
  File "/home/pradnya/ncats-adme/server/predictors/rlm/__init__.py", line 148, in load_gcnn_model
    rlm_gcnn_scaler, _ = load_scalers(rlm_gcnn_scaler_path)
  File "/home/pradnya/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/pradnya/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

I tried everything, right from manually installing every package to going to the depths of the code to actually find the source of the error. Finally I realized that it was a really simple solution. There was somehow an auto-downloaded corrupt file which was the root cause of the error and just deleting it solved it. Now this might seem like a trivial issue, but I think it causes huge inconvenience as the file is auto-downloaded and the error barely tells anything about it and we keep getting UnpicklingError .

Good job on debugging @Pradnya2203. Can you provide more context regarding the error. This would assist others that encounter a similar error.

GemmaTuron commented 1 year ago

Hi @Pradnya2203

Thanks, it seems the model downloaded when cloning the repo is corrupt? or is the model when you Download it manually that is corrupt? Automatic download when running the model seems to work fine, which is great! We faced a similar issue with the prediction of human liver metabolism with @pauline-banye , could you check if you are able to run that model as well or have the same unpickling error? If you have time to combine with week 3 tasks, ofc

Thanks :)

Pradnya2203 commented 1 year ago

Hey @GemmaTuron , The manually downloaded one is not corrupt, it's the one getting downloaded while setting up the repository. I am unable to run the human liver metabolism model on the browser after running app.py. The only error that it is giving is There was an error processing your file. Please make sure you have selected a file that contains SMILES, indicate if the file contains a header and the column number containing the SMILES., but the same input file is running on every other model I checked. For example it is giving the following output when run using PAMPA Permeability (pH 7.4) model. ADME_Predictions_2023-03-15-070819.csv

GemmaTuron commented 1 year ago

Hi @Pradnya2203 !

Thanks, on the PAMPA model, I see it says PAMPA50, so might it be you are running PAMPA 5.0 not 7.4? For the Human Liver Metabolism model, that is surprising since it is using the exact same data loader function (from the Gcnn Base class.. can you paste here the input file?

Pradnya2203 commented 1 year ago

Hey @pauline-banye @GemmaTuron Going into detail the error faced + a solution

Issue

On running python app.py the checkpoints for each model gets downloaded into it’s specific directory (say models/rlm) If there is an interruption (for example due to network issues, as was my case) the downloaded checkpoint file is corrupted.

This causes the UnpicklingError and the python app.py command exits, the user is stuck unless they navigate to the corrupted file under Models, remove it and rerun the command.

Solution

I believe this is an issue that should be opened on ncats-adme as we can implement some sort of error handling (while loading checkpoints) to validate the said checkpoint file. If it’s corrupted, the model checkpoint should be re-downloaded OR an appropriate “loss of network connection error be posted

Pradnya2203 commented 1 year ago

Hi @Pradnya2203 !

Thanks, on the PAMPA model, I see it says PAMPA50, so might it be you are running PAMPA 5.0 not 7.4? For the Human Liver Metabolism model, that is surprising since it is using the exact same data loader function (from the Gcnn Base class.. can you paste here the input file?

I think it was pampa 7.4 only. This is the input file: input.csv

GemmaTuron commented 1 year ago

Hi @Pradnya2203

Just to eb sure, can you once more download the PAMPA 7.4 and test it? see if we are getting a PAMPA74 on the model coulmn or still PAMPA50. I'll collect all these issues and write to the authors to clarify these points, thanks! For the Human Cytosolic Model, you were able to run from the server app but not running the app.py file, from what I understand in above comments right?

To close off this part:

Pradnya2203 commented 1 year ago

Hey @GemmaTuron This is the input file is : input.csv This is the prediction using Human Cytosolic Model : ADME_Predictions_2023-03-22-224133.csv This is the prediction using PAMPA74: ADME_Predictions_2023-03-22-224316.csv This is the prediction using PAMPA50: ADME_Predictions_2023-03-22-224417.csv This is the output of Ersilia Model Hub implementation of NCATS Human Cytosolic Model: ncats-hlcs.txt

Pradnya2203 commented 1 year ago

I was able to run app.py file but was unable to run from server app for Human Liver Cytosol Stability Model and am receiving the following error There was an error processing your file. Please make sure you have selected a file that contains SMILES, indicate if the file contains a header and the column number containing the SMILES. on using the same input file as above.

Pradnya2203 commented 1 year ago

Week 3: Model Proposal one

Model Name:

ADMET_XGBoost

Model Description:

The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. The model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, the model is ranked first in 18 tasks and top 3 in 21 tasks.

Task:

Accurate ADMET prediction

Package Dependencies:

python=3.7 rdkit deepchem scikit-learn PyTDC xgboost mordred gensim tensorflow~=2.4 PubChemPy

Publication:

https://paperswithcode.com/paper/accurate-admet-prediction-with-xgboost

Supplementary Information

https://arxiv.org/pdf/2204.07532v3.pdf

Source Code:

https://github.com/smu-tao-group/ADMET_XGBoost

License

GNU General Public License v3.0

Pradnya2203 commented 1 year ago

Week 3: Model Proposal two

Model Name:

AI-Bind

Model Description:

Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. AI-Bind is a pipeline that combines network-based sampling strategies with unsupervised pre-training, allowing us to limit the annotation imbalance and improve binding predictions for novel proteins and ligands. AI-Bind predicted drugs and natural compounds with binding affinity to SARS-CoV-2 viral proteins and the associated human proteins. These predictions are also validated via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. Overall, AI-Bind offers a powerful high-throughput approach to identify drug-target combinations, with the potential of becoming a powerful tool in drug discovery.

Package Dependencies:

requirements.txt

Publication:

https://paperswithcode.com/paper/ai-bind-improving-binding-predictions-for

Supplementary Information:

https://arxiv.org/pdf/2112.13168v5.pdf

Source Code:

https://github.com/chatterjeeayan/ai-bind

Data files:

https://zenodo.org/record/7226641

License:

MIT License

Pradnya2203 commented 1 year ago

Week 3: Model Proposal Three

Model Name:

OpenChem

Model Description:

OpenChem is a deep learning toolkit for Computational Chemistry with PyTorch backend. The goal of OpenChem is to make Deep Learning models an easy-to-use tool for Computational Chemistry and Drug Design Researchers.

Main Features:

Modular design with unified API, modules can be easily combined with each other. OpenChem is easy-to-use: new models are built with only configuration file. Fast training with multi-gpu support. Utilities for data preprocessing. Tensorboard support.

Package Dependencies:

numpy pyyaml scipy ipython mkl scikit-learn six pytest pytest-cov

Tasks:

Classification (binary or multi-class) Regression Multi-task (such as N binary classification tasks) Generative models

Publication:

https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c00971

Supplementary Information:

https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c00971

Source Code:

https://github.com/Mariewelt/OpenChem

License:

MIT License

GemmaTuron commented 1 year ago

Hi @Pradnya2203 !

Similar to OpenMM that @samuelmaina has pointed to, OpenChem is a framework to develop models, but not a model in itself, so we could not directly incorporate it in the Hub, we should use it to train models and then incorporate those in the Hub - I don't like the fact that Nvidia GPU's are required to run OpenChem, since most computers do not have them. But thanks for the suggestion, looking forward to your next ones!

GemmaTuron commented 1 year ago

Hi @Pradnya2203 !

Sorry, I missed the above: ADMET_XGBoost : good catch, looks interesting but I fail to see the model checkpoints or the data to retrain the models, is any of this available? AI-Bind: I did not know about this tool, they are intensively developing this it seems like (5 updates in Arxv thus far!). Looks like a promising approach, at this moment we cannot incorporate it in the Hub because we cannot pass protein as input and I see in the requirements it will need GPU to run (we try to avoid serving models that require NVIDIA GPU's, because most people won't have access to them) - but I'll keep an eye on the tool and see if we can use it!

GemmaTuron commented 1 year ago

@Pradnya2203 ,

As next steps,

Pradnya2203 commented 1 year ago

Hey @GemmaTuron,

I tried to run REDIAL 2000, it was fairly easy to run and I used their own sample dataset sample_data.csv

and got the following results: 3CL-sample_data-consensus.csv ACE2-sample_data-consensus.csv AlphaLISA-sample_data-consensus.csv CoV1-PPE_cs-sample_data-consensus.csv CoV1-PPE-sample_data-consensus.csv CPE-sample_data-consensus.csv cytotox-sample_data-consensus.csv hCYTOX-sample_data-consensus.csv MERS-PPE_cs-sample_data-consensus.csv MERS-PPE-sample_data-consensus.csv TruHit-sample_data-consensus.csv

REDIAL-2020 is an open-source, open-access machine learning suite for estimating anti-SARS-CoV-2 activities from molecular structure. By leveraging data available from NCATS, eleven categorical machine learning models are developed: CPE, cytotox, AlphaLISA, TruHit, ACE2, 3CL, CoV-PPE, CoV-PPE_cs, MERS-PPE, MERS-PPE_cs and hCYTOX. These models are exposed on the REDIAL-2020 portal, and the output of a similarity search using input data as a query is provided for every submitted molecule. The top-ten most similar molecules to the query molecule from the existing COVID-19 databases, together with associated experimental data, are displayed. This allows users to evaluate the confidence of the machine learning predictions.

Pradnya2203 commented 1 year ago

With the ADMET-XGBoost, I tried running it as well, it does have the dataset available but I fail to see any checkpoints. I tried finding it on their documentation as well but was unable to.