Closed emmakodes closed 1 year ago
Hello, I have been able to install Ersilia Model Hub on my system without any error following this instruction Here is my system specification: SPEC: Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy
I tested a simple model-eos3b5e using the following command: ersilia -v fetch eos3b5e ersilia serve eos3b5e ersilia -v api calculate -i "CCCC"
and got the following output printed on the CLI: { "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }
emmakodes' Motivation Letter for Contributing to Ersilia
I am a Software Engineer with experience in Data science. I am interested in building software and machine learning solutions to impact human lives. I have participated in machine learning competitions on Kaggle, zindi, and other Data Science Platforms where I did my best to build accurate models which led me to earn top solutions in various competitions. I have experience in building models and deploying those models for use by everyone. My skill set includes: Python, Machine learning, Deep learning, conda, Google colab, Django
I have decided to contribute to Ersilia's project because of the MASSIVE IMPACT it will have in Africa where I am from and every low-resourced country. Ersilia's success will mean a lot to Africa(my place) as I have seen my people die of infectious diseases due to a lack of resources to carry out research to discover cures for these diseases. My people are so in need of Ersilia's solution and its success will mean a lot to us. I will definitely continue to contribute to Ersilia whether I get Outreachy or not because Ersilia's success is a must and I will give my best to help make it succeed.
Contributing to Ersilia will help me improve my skill in Machine Learning as applied to Healthcare. I have always wanted to contribute to healthcare data analytics as I understand the impact it has on human lives, especially in low to medium-income countries.
I plan to leverage my skills in machine learning and software engineering during the internship to help Ersilia expand its library of AI/ML models available, by adding models identified in the literature and/or training new models where necessary. I also intend to continue to contribute to Ersilia even after the internship and pursue further studies in building data science tools for infectious and neglected disease research.
Hi @emmakodes
Welcome to the contribution period!
Thanks @GemmaTuron glad to contribute
Hello @GemmaTuron, I am selecting to work on Plasma Protein Binding model (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt
For this IDL-PPBopt model, the objective is to predict the plasma protein binding (PPB *) property based on an interpretable deep learning method.
I am interested in its application and motivated to study to know the PPB values of molecular compounds to know if these drugs when entering the body and interacting with plasma proteins will be able to remain bound to the plasma protein or remain free and find their pharmacological target.
I created a virtual environmnt 'ppb' and activated the virutal environment using the following command: conda create -n ppb python=3.7 conda activate ppb
Then I installed the following packages since the model requires these packages to run: pip install rdkit==2022.9.4 pip install scipy==1.7.3 pip install scikit-learn==1.0.2 pip install pandas==1.3.5 pip install matplotlib==3.5.3 conda install openbabel=2.4.1 -c conda-forge conda install pytorch==1.5.0 torchvision==0.6.0 cpuonly -c pytorch
I noticed some packages don't have a particular version, so I have to do that manually. Also I used 'conda install pytorch==1.5.0 torchvision==0.6.0 cpuonly -c pytorch' since I have only a cpu and ersilia model are mostly run on cpu
Task 3: run predictions i
To run the prediction, I located the main file to run the code 'IDL-PPBopt.ipynb'
I wanted to run the code from the terminal, so I extracted the code needed to get prediction and pasted it to a new python file 'mainfile.py'.
I commented lines of code that were still dependent on IPython and cairosvg from 'mainfile.py'
Since I have only a cpu, I made the following changes to the code:
I added the following line of code after some import statements to use cpu: device = torch.device( 'cpu')
Changed the default value of the floating point tensor type from: torch.set_default_tensor_type('torch.cuda.FloatTensor') to: torch.set_default_tensor_type('torch.FloatTensor')
Commented the following line of code: torch.backends.cudnn.benchmark = True
I changed the following line of code to use the previously configured device: model.cuda() to: model.to(device)
Changed the following line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-222020'+'54'+'.pt') to the following line of code to use cpu best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-222020'+'54'+'.pt',map_location=torch.device('cpu'))
I changed everything that instantiated cuda: torch.cuda.LongTensor and torch.cuda.FloatTensor to the following line of code: torch.LongTensor and torch.FloatTensor respectively
Then I located 'AttentiveLayers.py' and changed 'torch.cuda.FloatTensor' to 'torch.FloatTensor' in 'FingerPrint' class since my code in 'mainfile.py' is dependent on it.
I ran 'mainfile.py' successfully and it outputted the following prediction(csv): temp.csv
Hi @emmakodes
Please avoid screenshots as they are difficult to read. Your next steps should be to explain what is the model doing and what is the output you are getting
Hello @GemmaTuron
Thanks for the correction. I am currently working on explaining what the model is doing and the output I am getting
I tried running predictions for the Essential Medicines List but it kept ending with a 'Killed' statement, It seems because my pc capacity is not enough to run prediction for that amount of rows.
I split the data in the EML csv file into five different csv files and then I successfully get predictions for the molecules. Here is the predictions(outputs) : eml_canonical_predictions.csv
IDL-PPBopt model predicts and optimizes the plasma protein binding(PPB) of a compound using an Interpretable Deep Learning Method.
They trained a deep learning model with the AttentiveFP algorithm passing canonalized smiles as input and saved the model in the "saved_models" file to predict the values of PPB.
The prediction values have mostly values in the range of 0.0 to 1.0 (even though there are cases of having values below or above this range). In percentages, this will be 0 to 100%.
So, for PPB values greater than 80% then that drug has a higher affinity to be more bound to plasma proteins and otherwise.
Hi @emmakodes !
Is the model a classifier or a regression? if the result is 0.7, would this be considered a 70% affinity form your explanation? thanks :)
Hello @GemmaTuron
The model is a regression task. It outputs the PPB fraction. They used measures like RMSE to evaluate the model which is mostly used for regression tasks.
Yes, a 0.7 result will be considered a 70% affinity
Hi @emmakodes !
Thanks for the explanation, indeed this is a regression outputting the % of PPB. Can you compare the results of the model you installed with what the Ersilia Model Hub is giving, to make sure we implemented it correctly? Next, if you finish before the end of the week, I'd like you to try and install the NCATS models and run predictions with the PAMPA 7.4 model (see the development branch to find the model) -- @pauline-banye found inconsistencies when using this model and I want to make sure if that is the case
thanks!
Thanks, @GemmaTuron for the feedback. I will continue working on the given task
Hello @GemmaTuron when I try to fetch eos22io model (Plasma Protein Binding model (IDL-PPBopt)) from Ersilia Model Hub so as to compare with my installed model prediction, I get the following error(basically it's complaining of 'No module named 'torch'
(ersilia) emma@DESKTOP-OI0BCU0:~/code/ersiliamain$ ersilia fetch eos22io
⬇️ Fetching model eos22io: idl-ppbopt
Checking setup: 1.518s
Preparing model: 19.72074866294861s
Getting model: 22.12224292755127s
Packing model: 2030.823763370514s
Checking if model needs to be integrated to a tool: 0.5035126209259033s
Getting model card: 7.391462087631226s
Checking that autoservice works: 17.67511010169983s
88%|█████████████████████████████████████████████████████████████████████████▌ | 7/8 [34:59<03:58, 238.15s/it]🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨
Error message:
Ersilia exception class:
EmptyOutputError
Detailed error:
Model API eos22io:run did not produce an outputTraceback (most recent call last):
File "/home/emma/eos/repository/eos22io/20230314180747_74AEEB/eos22io/artifacts/framework/code/main.py", line 8, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
Hints:
- Visit the fetch troubleshooting site
If this error message is not helpful, open an issue at:
- https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
- hello[at]ersilia.io
If you haven't, try to run your command in verbose mode (-v in the CLI)
- You will find the console log file in: /home/emma/eos/current.log
@GemmaTuron
For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.
(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Traceback (most recent call last):
File "app.py", line 22, in <module>
from predictors.pampa.pampa_predictor import PAMPAPredictior
File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module>
pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url)
File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo
gcnn_scaler, _ = load_scalers(model_file_path)
File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers
state = torch.load(path, map_location=lambda storage, loc: storage)
File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
Hi @emmakodes ,
thanks for the work, a few pointers:
@GemmaTuron
For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.
(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py Loading PAMPA graph convolutional neural network model Model File Exists Locally Traceback (most recent call last): File "app.py", line 22, in <module> from predictors.pampa.pampa_predictor import PAMPAPredictior File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module> pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url) File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo gcnn_scaler, _ = load_scalers(model_file_path) File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers state = torch.load(path, map_location=lambda storage, loc: storage) File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.
@GemmaTuron Yes, I finally found a solution to this. The model file was faulty, so I had to download it manually from: https://opendata.ncats.nih.gov/public/adme/models/archived/pampa/gcnn_model-20230201.pt P.S Above model is for PAMPA pH 7.4
Here is the prediction for PAMPA pH 7.4 using data Ersilia provides to test the models: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv
Hi @emmakodes ,
thanks for the work, a few pointers:
- eos22io model: do you have the log file of the fetching? It seems something was not properly downloaded, maybe an internet issue? Please provide this (use the -v flag at fetch time and collect the log as .txt).
- NCATS models: I am unsure if you are actually downloading the model and placing it in the right folder? Same as Zakia and Pradnya
Okay I will do this immediately and give feedback
@emmakodes that's good progress! Can I ask you to run rpedictions again for the same molecules using PAMPA 7.4 and see if the results are the same? @pauline-banye tagging you here so you can also follow up. In short: Pauline experienced different results when runing pampa7.4 predictions several times
@GemmaTuron
For NCATS models, It takes a large amount of time to set up on my pc and most times my system just hangs up. So, I installed the needed packages individually. Currently, I have the following error which I'm looking to find a fix.
(/home/emma/code/ncat/ncats-adme/server/env) emma@DESKTOP-OI0BCU0:~/code/ncat/ncats-adme/server$ python app.py Loading PAMPA graph convolutional neural network model Model File Exists Locally Traceback (most recent call last): File "app.py", line 22, in <module> from predictors.pampa.pampa_predictor import PAMPAPredictior File "/home/emma/code/ncat/ncats-adme/server/predictors/pampa/__init__.py", line 22, in <module> pampa_gcnn_scaler, pampa_gcnn_model, pampa_gcnn_model_version = load_gcnn_model_with_versioninfo(pampa_model_file_path, pampa_model_file_url) File "/home/emma/code/ncat/ncats-adme/server/predictors/utilities/utilities.py", line 87, in load_gcnn_model_with_versioninfo gcnn_scaler, _ = load_scalers(model_file_path) File "/home/emma/code/ncat/ncats-adme/server/./predictors/chemprop/chemprop/utils.py", line 132, in load_scalers state = torch.load(path, map_location=lambda storage, loc: storage) File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 585, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/emma/code/ncat/ncats-adme/server/env/lib/python3.8/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.
@GemmaTuron Yes, I finally found a solution to this. The model file was faulty, so I had to download it manually from: https://opendata.ncats.nih.gov/public/adme/models/archived/pampa/gcnn_model-20230201.pt P.S Above model is for PAMPA pH 7.4
Here is the prediction for PAMPA pH 7.4 using daa Ersilia provides to test the models: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv
This is interesting, seems the Pampa 7 pt model has been updated. I'm excited to see your results. Great job
Hello @GemmaTuron @pauline-banye
I ran five different predictions and the result are same. Here are the prediction files: eml_canonical_PAMPA7.4_ADME_Predictions_2023-03-17-083932.csv ADME_Predictions_2023-03-17-090932_no2.csv ADME_Predictions_2023-03-17-091048_no3.csv ADME_Predictions_2023-03-17-091303_no4.csv ADME_Predictions_2023-03-17-091349.csv
Hello @GemmaTuron when I try to fetch eos22io model (Plasma Protein Binding model (IDL-PPBopt)) from Ersilia Model Hub so as to compare with my installed model prediction, I get the following error(basically it's complaining of 'No module named 'torch'
(ersilia) emma@DESKTOP-OI0BCU0:~/code/ersiliamain$ ersilia fetch eos22io ⬇️ Fetching model eos22io: idl-ppbopt Checking setup: 1.518s Preparing model: 19.72074866294861s Getting model: 22.12224292755127s Packing model: 2030.823763370514s Checking if model needs to be integrated to a tool: 0.5035126209259033s Getting model card: 7.391462087631226s Checking that autoservice works: 17.67511010169983s 88%|█████████████████████████████████████████████████████████████████████████▌ | 7/8 [34:59<03:58, 238.15s/it]🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨 Error message: Ersilia exception class: EmptyOutputError Detailed error: Model API eos22io:run did not produce an outputTraceback (most recent call last): File "/home/emma/eos/repository/eos22io/20230314180747_74AEEB/eos22io/artifacts/framework/code/main.py", line 8, in <module> import torch ModuleNotFoundError: No module named 'torch' Hints: - Visit the fetch troubleshooting site If this error message is not helpful, open an issue at: - https://github.com/ersilia-os/ersilia Or feel free to reach out to us at: - hello[at]ersilia.io If you haven't, try to run your command in verbose mode (-v in the CLI) - You will find the console log file in: /home/emma/eos/current.log
Hello @GemmaTuron here is the log file as regard above issue: my_eos22io_log.txt
What I have tried recently: . I activated the model environment and installed torch manually but I still got the same error. I also ran the model on google colab but got same error.
mmm thanks @emmakodes ! @carcablop does this model run on colab? I think we tested it right?
For your next steps @emmakodes:
Actually, indeed there seems to be a version mismatch currently that does not allow the eos22io to work properly! @carcablop see the latest run when I updated the metadata in airtable: https://github.com/ersilia-os/eos22io/actions/runs/4468049751/jobs/7848292761
Hello @GemmaTuron. Yes, we tested this model both in the CLI and in COLAB and it worked fine. It was tested by Pauline and Dhanshree too. I'll check with the latest update. Thanks Gemma.
Hi @GemmaTuron I have changed the way I install torch, using pip, instead of conda, as this was causing problems when trying to use the conda channel. Here are the changes. https://github.com/ersilia-os/eos22io/pull/4
HDAC3i-Finder: A Machine Learning-based Computational Tool to Screen for HDAC3 Inhibitors
Histone deacetylase 3 (HDAC3) Finder trained model help to screen(identify) for HDAC3 inhibitor in compounds. Histone deacetylase 3 (HDAC3) is a prospective drug target for the treatment of human diseases such as cancer.
The authors used machine learning to create a model to identify HDAC3 inhibitors from a set of 1098 compounds(training set). They used three different sets of molecular features for each compound, and five machine-learning classifiers were trained on each feature set. The best-performing model was based on the Morgan2 fingerprints and achieved a high early ROC enrichment. Further retrospective screening of an annotated chemical library in PubChem identified 8 novel-scaffold HDAC3 inhibitors while assaying only 1% of the compounds. The authors also developed a python GUI application named HDAC3i-Finder to facilitate prospective screening for HDAC3 inhibitors.
hdac3i-finder
https://onlinelibrary.wiley.com/doi/10.1002/minf.202000105
https://github.com/jwxia2014/HDAC3i-Finder
S2DV: converting SMILES to a drug vector for predicting the activity of anti-HBV small molecules
The machine learning model based on training results can effectively predict the inhibitory effect of compounds on HBV and liver toxicity.
The model adopts a natural language processing technique to analyze compound simplified molecular input line entry system(SMILES) strings, enabling an accurate representation of the relationship between compounds and their substructures. This technique helps the model predict inhibitory effects on HBV and liver toxicity, verified by wet-lab experiments. This model provides a new perspective on the prediction of compound properties for anti-HBV drugs to improve hepatitis B diagnosis and human health in the future.
s2dv
https://pubmed.ncbi.nlm.nih.gov/35062019/
https://github.com/NTU-MedAI/S2DV
rdkit
pickle
numpy
streamlit (optional)
Classification
Thanks @GemmaTuron @carcablop I just ran eos22io model on google colab successfully. The No module named torch
issue is fixed.
Here is the prediction output for Essential Medicines List:
eos22io_output.csv
eos22io prediction output for Essential Medicines List is mostly same as my own installed model. The only slight difference I see is that eos22io prediction decimal is a bit rounded up. Example: eos22io prediction My own installed Model Prediction 0.477624 0.47762394 0.97662824 0.9766282 0.5828323 0.58283216 0.0710884 0.071088396 0.6505308 0.6505309
Hi @emmakodes !
About model eos22io
, fantastic, good job on fixing the bug @carcablop, and thanks for testing @emmakodes !
For the model suggestions:
As next steps, could I ask you to: (in order of priority)
Hi @emmakodes !
About model
eos22io
, fantastic, good job on fixing the bug @carcablop, and thanks for testing @emmakodes ! For the model suggestions:
- HDAC2 seems very straightforward to implement, I like it. HDAC2 is mainly involved in the development of cancerous lesions, so it is not directly related to our mission, which is the only reason I'd deprioritize it in our list of models to be incorporated
- S2DV: I like the rationale behind the model, concerned about the little information available on their GitHub repo!
As next steps, could I ask you to: (in order of priority)
- Add both models to our "model suggestion list" here
- Find a third model
- If time allows, try to run one of the models, perhaps let's start with the S2DV?
Hello @GemmaTuron I have added both models to the "model suggestion list". I will work on implementing the S2DV model immediately and I will give feedback here.
REDIAL-2020: models predict Anti-SARS-CoV-2 Activities (Live Virus Infectivity, Viral Entry, Viral Replication, In vitro Infectivity, Human Cell Toxicity) and also perform Similarity Search
The Models were trained using data available from NCATS. These models predict anti-SARS-CoV-2 activities from molecular structure. It makes predictions for Live Virus Infectivity, Viral Entry, Viral Replication, In vitro Infectivity, and Human Cell Toxicity. It also performs Similarity Search where it displays the most similar molecules to the input query, as well as shows associated experimental data.
redial-2020
https://www.nature.com/articles/s42256-021-00335-w#Sec9
Github Repository: https://github.com/sirimullalab/redial-2020/tree/v1.0
Hi @emmakodes !
Good model, can you please add it to the model suggestion list? Thanks!
Hi @emmakodes !
Good model, can you please add it to the model suggestion list? Thanks!
Thanks, @GemmaTuron I will do that immediately
Let me know about the S2DV when you have updates, thanks!
Let me know about the S2DV when you have updates, thanks!
Okay, I will do that.
I have added REDIAL-2020 to the model suggestion list
Hello @GemmaTuron I have run S2DV model successfully on my system and here is the update:
S2DV
SMILES
The inhibitory effect of compound on HBV Toxicity on HepG2 Whether it can be used as a potential HBV drug
Create a conda environment:
conda create -n s2dv2 python=3.7
Activate conda environment:
conda activate s2dv2
Install dependencies.
pip install scikit-learn==0.24.2
pip install xgboost==1.4.1
pip install rdkit==2022.9.4
Clone the repository of the original model:
git clone https://github.com/NTU-MedAI/S2DV.git
Open the folder containing the model on 'Visual Studio Code' or any code editor, then open the file S2DV_main.py. Comment import streamlit as st
and 'web_demo' function since we don't need a web interface.
Also change:
if __name__ == '__main__':
web_demo()
to:
if __name__ == '__main__':
main()
The author provided an example SMILES input: 'Nc1cc(OCCOCP(=O)(O)O)nc(N)n1' to test out the model.
On the terminal, run the following to change the directory to 'S2DV' folder where 'S2DV_main.py' file is located.
cd S2DV
Run the following on the terminal to predict the inhibitory effect of 'Nc1cc(OCCOCP(=O)(O)O)nc(N)n1' on HBV, Toxicicity on HepG2 and whether it can be used as a potential HBV drug:
python S2DV_main.py
The model returns the following output:
Entered SMILES: Nc1cc(OCCOCP(=O)(O)O)nc(N)n1
The inhibitory effect on HBV is predicted to be:Low inhibition rate, IC50 higher than 1uM
Toxicity on HepG2 is predicted to be: Low toxicity, CC50 higher than 30uM
Whether it can be used as a potential HBV drug: Does not have the potential to be made into a drug
P.S.: The authors wrote some statements in the file 'S2DV_main.py' in Chinese, so I had to translate them into English.
Hello @GemmaTuron this is out of scope. I just wanted to say thank you for your awesome guidance in helping us contribute to Ersilia. You are an awesome mentor. I have always wanted to apply machine learning to drug discovery and right now Ersilia is making that dream possible.
Hi @emmakodes !
That's great thanks for testing the model. Funny about the Chinese statements! The final model output is very clear but it would make it difficult to incorporate in the Hub (better to have only numbers). Can you identify what is the actual outcome? I'm guessing a probability for the inhibitory HBV and toxicity? If we have one number we could work on incorporating this in the hub!
Hello @GemmaTuron
Yes, the actual outcome is a probability. The author only used statements instead of using 0 and 1 to represent inhibitory HBV and toxicity.
The following is how the author stated the statements in his code.
if HBV_predict:
HBV_result = 'High inhibition rate, IC50 lower than 1uM'
else:
HBV_result = 'Low inhibition rate, IC50 higher than 1uM'
if HepG2_predict:
HepG2_result = 'High toxicity, CC50 below 30uM'
else:
HepG2_result = 'Low toxicity, CC50 higher than 30uM'
We can easily represent the above statement to be the following to use numbers instead:
if HBV_predict:
HBV_result = 1
else:
HBV_result = 0
if HepG2_predict:
HepG2_result = 1
else:
HepG2_result = 0
We can also just stick to predicting for inhibitory HBV or toxicity.
Thanks @emmakodes ! The best would be to actually get the probability, since it helps rank compounds (for example, two compounds might be a 1, but with 0.5 proba and 0.99 proba, so as a researcher I'd like the 0.99 !) Can you identify the piece of code that is converting the probability to the statement?
You are right @GemmaTuron
Here is the code from the author that predicts and returns the actual label(Y_predict) and also returns the probability prediction(Y_predict_p):
def vec_predict(vec,models,ML_model):
for model_name,model in models:
if model_name == ML_model:
print(vec)
print(len(vec))
Y_predict = model.predict(vec.reshape(1, -1))
Y_predict_p = model.predict_proba(vec.reshape(1, -1))[:, 1]
return Y_predict, Y_predict_p
The author only decided to work with 'Y_predict' which returns the actual prediction to determine the statement.
In our case, we can just return 'Y_predict_p' (the probability) for a compound.
So for Nc1cc(OCCOCP(=O)(O)O)nc(N)n1 compound, the model returned 0.15674622
as the probability.
great @emmakodes !
Would you want to give it a try to incorporate this in the Hub? I suggest incorporating it as two different models, one for HBV prediction and one for HepG2. If you are up to this, for the last week of the contribution period, could you open a Model Request issue?
Okay @GemmaTuron I will do that immediately and I will give feedback.
The former outreachy interns prepared a very detailed guide of what the process looks like: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/contribute-models/example-of-the-model-incorporation-workflow
Make sure to read through to understand the whole process
Okay thanks @GemmaTuron I will do that
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application