ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: emmakodes #635

Closed emmakodes closed 1 year ago

emmakodes commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

emmakodes commented 1 year ago

Hello, I have been able to install Ersilia Model Hub on my system without any error following this instruction Here is my system specification: SPEC: Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy

I tested a simple model-eos3b5e using the following command: ersilia -v fetch eos3b5e ersilia serve eos3b5e ersilia -v api calculate -i "CCCC"

and got the following output printed on the CLI: { "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }

emmakodes commented 1 year ago

emmakodes' Motivation Letter for Contributing to Ersilia

I am a Software Engineer with experience in Data science. I am interested in building software and machine learning solutions to impact human lives. I have participated in machine learning competitions on Kaggle, zindi, and other Data Science Platforms where I did my best to build accurate models which led me to earn top solutions in various competitions. I have experience in building models and deploying those models for use by everyone. My skill set includes: Python, Machine learning, Deep learning, conda, Google colab, Django

I have decided to contribute to Ersilia's project because of the MASSIVE IMPACT it will have in Africa where I am from and every low-resourced country. Ersilia's success will mean a lot to Africa(my place) as I have seen my people die of infectious diseases due to a lack of resources to carry out research to discover cures for these diseases. My people are so in need of Ersilia's solution and its success will mean a lot to us. I will definitely continue to contribute to Ersilia whether I get Outreachy or not because Ersilia's success is a must and I will give my best to help make it succeed.

Contributing to Ersilia will help me improve my skill in Machine Learning as applied to Healthcare. I have always wanted to contribute to healthcare data analytics as I understand the impact it has on human lives, especially in low to medium-income countries.

I plan to leverage my skills in machine learning and software engineering during the internship to help Ersilia expand its library of AI/ML models available, by adding models identified in the literature and/or training new models where necessary. I also intend to continue to contribute to Ersilia even after the internship and pursue further studies in building data science tools for infectious and neglected disease research.

GemmaTuron commented 1 year ago

Hi @emmakodes

Welcome to the contribution period!

emmakodes commented 1 year ago

Thanks @GemmaTuron glad to contribute

emmakodes commented 1 year ago

WEEK 2: Run an ML model

Task 1 select a model

Hello @GemmaTuron, I am selecting to work on Plasma Protein Binding model (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt

For this IDL-PPBopt model, the objective is to predict the plasma protein binding (PPB *) property based on an interpretable deep learning method.

I am interested in its application and motivated to study to know the PPB values of molecular compounds to know if these drugs when entering the body and interacting with plasma proteins will be able to remain bound to the plasma protein or remain free and find their pharmacological target.

emmakodes commented 1 year ago

WEEK 2: Run an ML model

Task 2 install the model

  1. I created a virtual environmnt 'ppb' and activated the virutal environment using the following command: conda create -n ppb python=3.7 conda activate ppb

  2. Then I installed the following packages since the model requires these packages to run: pip install rdkit==2022.9.4 pip install scipy==1.7.3 pip install scikit-learn==1.0.2 pip install pandas==1.3.5 pip install matplotlib==3.5.3 conda install openbabel=2.4.1 -c conda-forge conda install pytorch==1.5.0 torchvision==0.6.0 cpuonly -c pytorch

I noticed some packages don't have a particular version, so I have to do that manually. Also I used 'conda install pytorch==1.5.0 torchvision==0.6.0 cpuonly -c pytorch' since I have only a cpu and ersilia model are mostly run on cpu

image

image

image

image

emmakodes commented 1 year ago

Task 3: run predictions i

Then I located 'AttentiveLayers.py' and changed 'torch.cuda.FloatTensor' to 'torch.FloatTensor' in 'FingerPrint' class since my code in 'mainfile.py' is dependent on it.

I ran 'mainfile.py' successfully and it outputted the following prediction(csv): temp.csv

image

GemmaTuron commented 1 year ago

Hi @emmakodes

Please avoid screenshots as they are difficult to read. Your next steps should be to explain what is the model doing and what is the output you are getting

emmakodes commented 1 year ago

Hello @GemmaTuron

Thanks for the correction. I am currently working on explaining what the model is doing and the output I am getting

emmakodes commented 1 year ago

Task 3: run predictions ii

I tried running predictions for the Essential Medicines List but it kept ending with a 'Killed' statement, It seems because my pc capacity is not enough to run prediction for that amount of rows.

SOLUTION

I split the data in the EML csv file into five different csv files and then I successfully get predictions for the molecules. Here is the predictions(outputs) : eml_canonical_predictions.csv

IDL-PPBopt model predicts and optimizes the plasma protein binding(PPB) of a compound using an Interpretable Deep Learning Method.

They trained a deep learning model with the AttentiveFP algorithm passing canonalized smiles as input and saved the model in the "saved_models" file to predict the values of PPB.

The prediction values have mostly values in the range of 0.0 to 1.0 (even though there are cases of having values below or above this range). In percentages, this will be 0 to 100%.

So, for PPB values greater than 80% then that drug has a higher affinity to be more bound to plasma proteins and otherwise.

GemmaTuron commented 1 year ago

Hi @emmakodes !

Is the model a classifier or a regression? if the result is 0.7, would this be considered a 70% affinity form your explanation? thanks :)

emmakodes commented 1 year ago

Hello @GemmaTuron

The model is a regression task. It outputs the PPB fraction. They used measures like RMSE to evaluate the model which is mostly used for regression tasks.

Yes, a 0.7 result will be considered a 70% affinity

GemmaTuron commented 1 year ago

Hi @emmakodes !

Thanks for the explanation, indeed this is a regression outputting the % of PPB. Can you compare the results of the model you installed with what the Ersilia Model Hub is giving, to make sure we implemented it correctly? Next, if you finish before the end of the week, I'd like you to try and install the NCATS models and run predictions with the PAMPA 7.4 model (see the development branch to find the model) -- @pauline-banye found inconsistencies when using this model and I want to make sure if that is the case

thanks!

emmakodes commented 1 year ago

Thanks, @GemmaTuron for the feedback. I will continue working on the given task

emmakodes commented 1 year ago

Hello @GemmaTuron when I try to fetch eos22io model (Plasma Protein Binding model (IDL-PPBopt)) from Ersilia Model Hub so as to compare with my installed model prediction, I get the following error(basically it's complaining of 'No module named 'torch'

(ersilia) emma@DESKTOP-OI0BCU0:~/code/ersiliamain$ ersilia fetch eos22io
⬇️  Fetching model eos22io: idl-ppbopt
Checking setup: 1.518s
Preparing model: 19.72074866294861s
Getting model: 22.12224292755127s
Packing model: 2030.823763370514s
Checking if model needs to be integrated to a tool: 0.5035126209259033s
Getting model card: 7.391462087631226s
Checking that autoservice works: 17.67511010169983s
 88%|█████████████████████████████████████████████████████████████████████████▌          | 7/8 [34:59<03:58, 238.15s/it]🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

Ersilia exception class:
EmptyOutputError

Detailed error:
Model API eos22io:run did not produce an outputTraceback (most recent call last):
  File "/home/emma/eos/repository/eos22io/20230314180747_74AEEB/eos22io/artifacts/framework/code/main.py", line 8, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Hints:
- Visit the fetch troubleshooting site

If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)
 - You will find the console log file in: /home/emma/eos/current.log
emmakodes commented 1 year ago

@GemmaTuron

For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.

(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Traceback (most recent call last):
  File "app.py", line 22, in <module>
    from predictors.pampa.pampa_predictor import PAMPAPredictior
  File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module>
    pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url)
  File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo
    gcnn_scaler, _ = load_scalers(model_file_path)
  File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
GemmaTuron commented 1 year ago

Hi @emmakodes ,

thanks for the work, a few pointers:

emmakodes commented 1 year ago

@GemmaTuron

For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.

(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Traceback (most recent call last):
  File "app.py", line 22, in <module>
    from predictors.pampa.pampa_predictor import PAMPAPredictior
  File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module>
    pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url)
  File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo
    gcnn_scaler, _ = load_scalers(model_file_path)
  File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

@GemmaTuron Yes, I finally found a solution to this. The model file was faulty, so I had to download it manually from: https://opendata.ncats.nih.gov/public/adme/models/archived/pampa/gcnn_model-20230201.pt P.S Above model is for PAMPA pH 7.4

Here is the prediction for PAMPA pH 7.4 using data Ersilia provides to test the models: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv

emmakodes commented 1 year ago

Hi @emmakodes ,

thanks for the work, a few pointers:

  • eos22io model: do you have the log file of the fetching? It seems something was not properly downloaded, maybe an internet issue? Please provide this (use the -v flag at fetch time and collect the log as .txt).
  • NCATS models: I am unsure if you are actually downloading the model and placing it in the right folder? Same as Zakia and Pradnya

Okay I will do this immediately and give feedback

GemmaTuron commented 1 year ago

@emmakodes that's good progress! Can I ask you to run rpedictions again for the same molecules using PAMPA 7.4 and see if the results are the same? @pauline-banye tagging you here so you can also follow up. In short: Pauline experienced different results when runing pampa7.4 predictions several times

pauline-banye commented 1 year ago

@GemmaTuron

For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.

(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Traceback (most recent call last):
  File "app.py", line 22, in <module>
    from predictors.pampa.pampa_predictor import PAMPAPredictior
  File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module>
    pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url)
  File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo
    gcnn_scaler, _ = load_scalers(model_file_path)
  File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
    state = torch.load(path, map_location=lambda storage, loc: storage)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

@GemmaTuron Yes, I finally found a solution to this. The model file was faulty, so I had to download it manually from: https://opendata.ncats.nih.gov/public/adme/models/archived/pampa/gcnn_model-20230201.pt P.S Above model is for PAMPA pH 7.4

Here is the prediction for PAMPA pH 7.4 using daa Ersilia provides to test the models: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv

This is interesting, seems the Pampa 7 pt model has been updated. I'm excited to see your results. Great job

emmakodes commented 1 year ago

Hello @GemmaTuron @pauline-banye

I ran five different predictions and the result are same. Here are the prediction files: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv ADME_Predictions_2023-03-17-090932_no2.csv ADME_Predictions_2023-03-17-091048_no3.csv ADME_Predictions_2023-03-17-091303_no4.csv ADME_Predictions_2023-03-17-091349.csv

emmakodes commented 1 year ago

Hello @GemmaTuron when I try to fetch eos22io model (Plasma Protein Binding model (IDL-PPBopt)) from Ersilia Model Hub so as to compare with my installed model prediction, I get the following error(basically it's complaining of 'No module named 'torch'

(ersilia) emma@DESKTOP-OI0BCU0:~/code/ersiliamain$ ersilia fetch eos22io
⬇️  Fetching model eos22io: idl-ppbopt
Checking setup: 1.518s
Preparing model: 19.72074866294861s
Getting model: 22.12224292755127s
Packing model: 2030.823763370514s
Checking if model needs to be integrated to a tool: 0.5035126209259033s
Getting model card: 7.391462087631226s
Checking that autoservice works: 17.67511010169983s
 88%|█████████████████████████████████████████████████████████████████████████▌          | 7/8 [34:59<03:58, 238.15s/it]🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

Ersilia exception class:
EmptyOutputError

Detailed error:
Model API eos22io:run did not produce an outputTraceback (most recent call last):
  File "/home/emma/eos/repository/eos22io/20230314180747_74AEEB/eos22io/artifacts/framework/code/main.py", line 8, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

Hints:
- Visit the fetch troubleshooting site

If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)
 - You will find the console log file in: /home/emma/eos/current.log

Hello @GemmaTuron here is the log file as regard above issue: my_eos22io_log.txt

What I have tried recently: . I activated the model environment and installed torch manually but I still got the same error. I also ran the model on google colab but got same error.

GemmaTuron commented 1 year ago

mmm thanks @emmakodes ! @carcablop does this model run on colab? I think we tested it right?

For your next steps @emmakodes:

GemmaTuron commented 1 year ago

Actually, indeed there seems to be a version mismatch currently that does not allow the eos22io to work properly! @carcablop see the latest run when I updated the metadata in airtable: https://github.com/ersilia-os/eos22io/actions/runs/4468049751/jobs/7848292761

carcablop commented 1 year ago

Hello @GemmaTuron. Yes, we tested this model both in the CLI and in COLAB and it worked fine. It was tested by Pauline and Dhanshree too. I'll check with the latest update. Thanks Gemma.

carcablop commented 1 year ago

Hi @GemmaTuron I have changed the way I install torch, using pip, instead of conda, as this was causing problems when trying to use the conda channel. Here are the changes. https://github.com/ersilia-os/eos22io/pull/4

emmakodes commented 1 year ago

Model:

HDAC3i-Finder: A Machine Learning-based Computational Tool to Screen for HDAC3 Inhibitors

Model Description

Histone deacetylase 3 (HDAC3) Finder trained model help to screen(identify) for HDAC3 inhibitor in compounds. Histone deacetylase 3 (HDAC3) is a prospective drug target for the treatment of human diseases such as cancer.

Summary

The authors used machine learning to create a model to identify HDAC3 inhibitors from a set of 1098 compounds(training set). They used three different sets of molecular features for each compound, and five machine-learning classifiers were trained on each feature set. The best-performing model was based on the Morgan2 fingerprints and achieved a high early ROC enrichment. Further retrospective screening of an annotated chemical library in PubChem identified 8 novel-scaffold HDAC3 inhibitors while assaying only 1% of the compounds. The authors also developed a python GUI application named HDAC3i-Finder to facilitate prospective screening for HDAC3 inhibitors.

Slug

hdac3i-finder

Publication:

https://onlinelibrary.wiley.com/doi/10.1002/minf.202000105

Github Repository:

https://github.com/jwxia2014/HDAC3i-Finder

License

GPL-3.0 license

emmakodes commented 1 year ago

Model:

S2DV: converting SMILES to a drug vector for predicting the activity of anti-HBV small molecules

Model Description

The machine learning model based on training results can effectively predict the inhibitory effect of compounds on HBV and liver toxicity.

Summary

The model adopts a natural language processing technique to analyze compound simplified molecular input line entry system(SMILES) strings, enabling an accurate representation of the relationship between compounds and their substructures. This technique helps the model predict inhibitory effects on HBV and liver toxicity, verified by wet-lab experiments. This model provides a new perspective on the prediction of compound properties for anti-HBV drugs to improve hepatitis B diagnosis and human health in the future.

Slug:

s2dv

Publication:

https://pubmed.ncbi.nlm.nih.gov/35062019/

Source Code

https://github.com/NTU-MedAI/S2DV

License

Apache-2.0 license

Package Dependencies

rdkit pickle numpy streamlit (optional)

Task

Classification

emmakodes commented 1 year ago

Thanks @GemmaTuron @carcablop I just ran eos22io model on google colab successfully. The No module named torch issue is fixed. Here is the prediction output for Essential Medicines List: eos22io_output.csv

eos22io prediction output for Essential Medicines List is mostly same as my own installed model. The only slight difference I see is that eos22io prediction decimal is a bit rounded up. Example:   eos22io prediction       My own installed Model Prediction 0.477624                        0.47762394 0.97662824                    0.9766282 0.5828323                      0.58283216 0.0710884                      0.071088396 0.6505308                      0.6505309

GemmaTuron commented 1 year ago

Hi @emmakodes !

About model eos22io, fantastic, good job on fixing the bug @carcablop, and thanks for testing @emmakodes ! For the model suggestions:

As next steps, could I ask you to: (in order of priority)

  1. Add both models to our "model suggestion list" here
  2. Find a third model
  3. If time allows, try to run one of the models, perhaps let's start with the S2DV?
emmakodes commented 1 year ago

Hi @emmakodes !

About model eos22io, fantastic, good job on fixing the bug @carcablop, and thanks for testing @emmakodes ! For the model suggestions:

  • HDAC2 seems very straightforward to implement, I like it. HDAC2 is mainly involved in the development of cancerous lesions, so it is not directly related to our mission, which is the only reason I'd deprioritize it in our list of models to be incorporated
  • S2DV: I like the rationale behind the model, concerned about the little information available on their GitHub repo!

As next steps, could I ask you to: (in order of priority)

  1. Add both models to our "model suggestion list" here
  2. Find a third model
  3. If time allows, try to run one of the models, perhaps let's start with the S2DV?

Hello @GemmaTuron I have added both models to the "model suggestion list". I will work on implementing the S2DV model immediately and I will give feedback here.

emmakodes commented 1 year ago

Model:

REDIAL-2020: models predict Anti-SARS-CoV-2 Activities (Live Virus Infectivity, Viral Entry, Viral Replication, In vitro Infectivity, Human Cell Toxicity) and also perform Similarity Search

Model Description

The Models were trained using data available from NCATS. These models predict anti-SARS-CoV-2 activities from molecular structure. It makes predictions for Live Virus Infectivity, Viral Entry, Viral Replication, In vitro Infectivity, and Human Cell Toxicity. It also performs Similarity Search where it displays the most similar molecules to the input query, as well as shows associated experimental data.

Slug

redial-2020

Publication

https://www.nature.com/articles/s42256-021-00335-w#Sec9

Github Repository: https://github.com/sirimullalab/redial-2020/tree/v1.0

License:

MIT License

GemmaTuron commented 1 year ago

Hi @emmakodes !

Good model, can you please add it to the model suggestion list? Thanks!

emmakodes commented 1 year ago

Hi @emmakodes !

Good model, can you please add it to the model suggestion list? Thanks!

Thanks, @GemmaTuron I will do that immediately

GemmaTuron commented 1 year ago

Let me know about the S2DV when you have updates, thanks!

emmakodes commented 1 year ago

Let me know about the S2DV when you have updates, thanks!

Okay, I will do that.

I have added REDIAL-2020 to the model suggestion list

emmakodes commented 1 year ago

Hello @GemmaTuron I have run S2DV model successfully on my system and here is the update:

Model:

S2DV

Model input

SMILES

Model Output:

The inhibitory effect of compound on HBV Toxicity on HepG2 Whether it can be used as a potential HBV drug

How to execute and test the model locally on system:

  1. Create a conda environment: conda create -n s2dv2 python=3.7

  2. Activate conda environment: conda activate s2dv2

  3. Install dependencies.

    pip install scikit-learn==0.24.2
    pip install xgboost==1.4.1
    pip install rdkit==2022.9.4
  4. Clone the repository of the original model: git clone https://github.com/NTU-MedAI/S2DV.git

  5. Open the folder containing the model on 'Visual Studio Code' or any code editor, then open the file S2DV_main.py. Comment import streamlit as st and 'web_demo' function since we don't need a web interface. Also change:

    if __name__ == '__main__':
     web_demo()

    to:

    if __name__ == '__main__':
    main()

The author provided an example SMILES input: 'Nc1cc(OCCOCP(=O)(O)O)nc(N)n1' to test out the model.

  1. On the terminal, run the following to change the directory to 'S2DV' folder where 'S2DV_main.py' file is located. cd S2DV

  2. Run the following on the terminal to predict the inhibitory effect of 'Nc1cc(OCCOCP(=O)(O)O)nc(N)n1' on HBV, Toxicicity on HepG2 and whether it can be used as a potential HBV drug: python S2DV_main.py

The model returns the following output:

Entered SMILES: Nc1cc(OCCOCP(=O)(O)O)nc(N)n1
The inhibitory effect on HBV is predicted to be:Low inhibition rate, IC50 higher than 1uM
Toxicity on HepG2 is predicted to be: Low toxicity, CC50 higher than 30uM
Whether it can be used as a potential HBV drug: Does not have the potential to be made into a drug

P.S.: The authors wrote some statements in the file 'S2DV_main.py' in Chinese, so I had to translate them into English.

emmakodes commented 1 year ago

Hello @GemmaTuron this is out of scope. I just wanted to say thank you for your awesome guidance in helping us contribute to Ersilia. You are an awesome mentor. I have always wanted to apply machine learning to drug discovery and right now Ersilia is making that dream possible.

GemmaTuron commented 1 year ago

Hi @emmakodes !

That's great thanks for testing the model. Funny about the Chinese statements! The final model output is very clear but it would make it difficult to incorporate in the Hub (better to have only numbers). Can you identify what is the actual outcome? I'm guessing a probability for the inhibitory HBV and toxicity? If we have one number we could work on incorporating this in the hub!

emmakodes commented 1 year ago

Hello @GemmaTuron

Yes, the actual outcome is a probability. The author only used statements instead of using 0 and 1 to represent inhibitory HBV and toxicity.

The following is how the author stated the statements in his code.

    if HBV_predict:
        HBV_result = 'High inhibition rate, IC50 lower than 1uM'
    else:
        HBV_result = 'Low inhibition rate, IC50 higher than 1uM'

    if HepG2_predict:
        HepG2_result = 'High toxicity, CC50 below 30uM'
    else:
        HepG2_result = 'Low toxicity, CC50 higher than 30uM'

We can easily represent the above statement to be the following to use numbers instead:

    if HBV_predict:
        HBV_result = 1
    else:
        HBV_result = 0

    if HepG2_predict:
        HepG2_result = 1
    else:
        HepG2_result = 0

We can also just stick to predicting for inhibitory HBV or toxicity.

GemmaTuron commented 1 year ago

Thanks @emmakodes ! The best would be to actually get the probability, since it helps rank compounds (for example, two compounds might be a 1, but with 0.5 proba and 0.99 proba, so as a researcher I'd like the 0.99 !) Can you identify the piece of code that is converting the probability to the statement?

emmakodes commented 1 year ago

You are right @GemmaTuron

Here is the code from the author that predicts and returns the actual label(Y_predict) and also returns the probability prediction(Y_predict_p):

def vec_predict(vec,models,ML_model):
    for model_name,model in models:
        if model_name == ML_model:
            print(vec)
            print(len(vec))
            Y_predict = model.predict(vec.reshape(1, -1))
            Y_predict_p = model.predict_proba(vec.reshape(1, -1))[:, 1]

    return Y_predict, Y_predict_p

The author only decided to work with 'Y_predict' which returns the actual prediction to determine the statement.

In our case, we can just return 'Y_predict_p' (the probability) for a compound.

So for Nc1cc(OCCOCP(=O)(O)O)nc(N)n1 compound, the model returned 0.15674622 as the probability.

GemmaTuron commented 1 year ago

great @emmakodes !

Would you want to give it a try to incorporate this in the Hub? I suggest incorporating it as two different models, one for HBV prediction and one for HepG2. If you are up to this, for the last week of the contribution period, could you open a Model Request issue?

emmakodes commented 1 year ago

Okay @GemmaTuron I will do that immediately and I will give feedback.

GemmaTuron commented 1 year ago

The former outreachy interns prepared a very detailed guide of what the process looks like: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/contribute-models/example-of-the-model-incorporation-workflow

Make sure to read through to understand the whole process

emmakodes commented 1 year ago

Okay thanks @GemmaTuron I will do that