Closed GemmaTuron closed 10 months ago
/approve
@GemmaTuron ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
Hi @GemmaTuron Apologies for not updating my progress here.
I had began working on the models but didn't update that here. Here is a link to what I have done so far: https://github.com/ersilia-os/eos5xng/pull/1
Hello @Richiio, On whether regression or classification tasks; here, we have to go with one.
In the source code, they provided checkpoints for 16 models, and we can't use all of them in one. @GemmaTuron, I believe we have to choose the best-performing model.
Hi @Richiio and @HellenNamulinda !
[ ]
and leaving the description as a string should workpredictions = [MolWt(Chem.MolFromSmiles(smi)) for smi in smiles_list]
instead of this we should have the predictions using each modelThanks @GemmaTuron for the clarification.
Hi @HellenNamulinda @GemmaTuron. The publication and source code didn't specify the args they used during training. This results in the following error when testing the original code and when testing with ersilia
The original code
(chemprop) root@Richio:~/Prediction-of-ATB-Activity# chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --no_features_scaling --preds_path tox21_preds.csv
Loading training args
Traceback (most recent call last):
File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
sys.exit(chemprop_predict())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
make_predictions(args=PredictArgs().parse_args())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 591, in wrap
result = func(*args, **kwargs)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 392, in make_predictions
) = load_model(args, generator=True)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 31, in load_model
update_prediction_args(predict_args=args, train_args=train_args)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 833, in update_prediction_args
raise ValueError(
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).
From ersilia
(eos5xng) root@Richio:~/eos5xng/model/framework# bash run.sh . ~/Desktop/test.csv ~/Desktop/output.csv
Loading training args
Traceback (most recent call last):
File "./code/main.py", line 41, in <module>
outputs = my_model(smiles_list)
File "./code/main.py", line 30, in my_model
preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list)
File "/root/eos5xng/model/framework/code/chemprop/utils.py", line 591, in wrap
result = func(*args, **kwargs)
File "/root/eos5xng/model/framework/code/chemprop/train/make_predictions.py", line 392, in make_predictions
) = load_model(args, generator=True)
File "/root/eos5xng/model/framework/code/chemprop/train/make_predictions.py", line 31, in load_model
update_prediction_args(predict_args=args, train_args=train_args)
File "/root/eos5xng/model/framework/code/chemprop/utils.py", line 834, in update_prediction_args
"Features were used during training so they must be specified again during "
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).
The error persists in both. I tried going through the publication to get the arguments they used in training but I haven't been able to get the exact arguments. Perhaps I missed a particular argument that generates this error. I've been on this for a while now.
Hi @Richiio
If I am not wrong, as arguments you only need the directory where the checkpoints are stored and the features used, which, as described in the text, are the rdkit descriptors, from what I understand in Table S2 (linked above) the normalized version:
arguments = [
'--checkpoint_dir', dir_model,
'--features_generator', 'rdkit_2d_normalized',
'--no_features_scaling'
]
I haven't seen mention of scaling anywhere
Hi @GemmaTuron from the documentation, when using rdkit_2d_normalized features, --no_features_scaling must be specified as seen by the log below
chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --preds_path tox21_preds.csv
Traceback (most recent call last):
File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
sys.exit(chemprop_predict())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
make_predictions(args=PredictArgs().parse_args())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/tap/tap.py", line 478, in parse_args
self.process_args()
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/args.py", line 923, in process_args
super(PredictArgs, self).process_args()
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/args.py", line 220, in process_args
raise ValueError('When using rdkit_2d_normalized features, --no_features_scaling must be specified.')
ValueError: When using rdkit_2d_normalized features, --no_features_scaling must be specified.
@HellenNamulinda asked me to save the features and run predictions based on the saved features
Hi @Richiio I don't see you are passing the '--no_features_scaling' as an argument. Please pass it. Also, I am not sure we need the train path and test path for prediction only, see the example I pointed you to: https://github.com/ersilia-os/eos3804/
@HellenNamulinda asked me to save the features and run predictions based on the saved features
This is an option if we don't have a way of specifying the features but in this case we know them, so I'd avoid re-training the model which is quite complex. You simply need to add the flag --no_features_scaling
as requested in the error log
The flag --no_features_scaling
means exactly this, no features scaling is used, which is what I think is happening in this case
This is my main.py file I used for the above
# imports
import os
import csv
import sys
import chemprop
# parse arguments
input_file = sys.argv[1]
output_file = sys.argv[2]
# current file directory
root = os.path.dirname(os.path.abspath(__file__))
dir_model= os.path.abspath(os.path.join(root,"..", "..","checkpoints", "classification-scaffold"))
# my model
def my_model(smiles_list):
smiles_list_list= [[smiles] for smiles in smiles_list]
arguments = [
'--test_path', '/dev/null',
'--preds_path', '/dev/null',
'--checkpoint_dir', dir_model,
'--features_generator', 'rdkit_2d_normalized',
'--no_features_scaling'
]
args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list)
return preds
# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
reader = csv.reader(f)
next(reader) # skip header
smiles_list = [r[0] for r in reader]
# run model
outputs = my_model(smiles_list)
#check input and output have the same lenght
input_len = len(smiles_list)
output_len = len(outputs)
assert input_len == output_len
# write output in a .csv file
with open(output_file, "w") as f:
writer = csv.writer(f)
writer.writerow(["Probability_score"]) # header
for o in outputs:
writer.writerow(o)
Hi @Richiio
Sorry I don't fully understand since your error message shows different arguments than the ones you are passing in main.py. This is what you shared just above:
chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --preds_path tox21_preds.csv
I'm sorry for not explaining well. I initially included the --no_features_scaling and got this error:
(chemprop) root@Richio:~/Prediction-of-ATB-Activity# chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --no_features_scaling --preds_path tox21_preds.csv
Loading training args
Traceback (most recent call last):
File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
sys.exit(chemprop_predict())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
make_predictions(args=PredictArgs().parse_args())
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 591, in wrap
result = func(*args, **kwargs)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 392, in make_predictions
) = load_model(args, generator=True)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 31, in load_model
update_prediction_args(predict_args=args, train_args=train_args)
File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 833, in update_prediction_args
raise ValueError(
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).
I thought you wanted me to remove the --no_features_scaling I passed in. I showed that output to explain that I needed it. I didn't communicate that well
Hello @Gemma, This model is quite similar to chemprop-antibiotic and also to the chemprop-sars-cov-inhibition that was incorporated by Miquel, where they could not provide the file for features_path directly, since the features depend on the molecules you want to make predictions on.
For this, you can use the save_features.py script provided and make predictions using these two commands
python scripts/save_features.py --data_path data/smiles.csv --save_path data/smiles_features.npz --features_generator rdkit_2d_normalized
python predict.py --test_path data/smiles.csv --checkpoint_dir checkpoint --preds_path test_preds.csv --features_path data/smiles_features.npz
This is my main.py file I used for the above
# imports import os import csv import sys import chemprop # parse arguments input_file = sys.argv[1] output_file = sys.argv[2] # current file directory root = os.path.dirname(os.path.abspath(__file__)) dir_model= os.path.abspath(os.path.join(root,"..", "..","checkpoints", "classification-scaffold")) # my model def my_model(smiles_list): smiles_list_list= [[smiles] for smiles in smiles_list] arguments = [ '--test_path', '/dev/null', '--preds_path', '/dev/null', '--checkpoint_dir', dir_model, '--features_generator', 'rdkit_2d_normalized', '--no_features_scaling' ] args = chemprop.args.PredictArgs().parse_args(arguments) preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list) return preds # read SMILES from .csv file, assuming one column with header with open(input_file, "r") as f: reader = csv.reader(f) next(reader) # skip header smiles_list = [r[0] for r in reader] # run model outputs = my_model(smiles_list) #check input and output have the same lenght input_len = len(smiles_list) output_len = len(outputs) assert input_len == output_len # write output in a .csv file with open(output_file, "w") as f: writer = csv.writer(f) writer.writerow(["Probability_score"]) # header for o in outputs: writer.writerow(o)
Hello @Richiio, I haven't had power the whole day but will provide test results after running. But as discussed on call, we are to run this model in two steps.
"""Loads a trained chemprop model checkpoint and makes predictions on a dataset."""
from chemprop.train import chemprop_predict
if name == "main": chemprop_predict()
By running the [above](https://github.com/ersilia-os/ersilia/issues/898#issuecomment-1847481171) two commands, you should be able to get predictions.
Thanks @HellenNamulinda !
Sorry I cannot dedicate more time to this today, @Richiio when you have time let us know if y are able to follow Hellen's steps, I'll catch up as soon as I can
@GemmaTuron @HellenNamulinda Thanks for the help. The model works and has been tested.
@Richiio It's great you were able to finally run the model, and I see the tests have passed.
Fantastic @Richiio Please when you are ready merge the code in your fork and let us know so we can have a look!
Hello @GemmaTuron, The Model Test on PR passed.
However, we will go ahead with some refactoring.
these models' run.sh file has two commands, different from just a single command python $1/code/main.py $2 $3
. I will discuss this with Sarima during our meeting.
@GemmaTuron, I have realized some chemprop and Grover models that require generating features before making predictions, and as such, they have two commands. If you could permit, I have a list of these and how they can be modified to have just the one usual command python $1/code/main.py $2 $3
Hi @HellenNamulinda !
Thanks for checking. Indeed, the run.sh file should not be modified, and the code that is now in run.sh should be part of main.py
It is better to incoporate all the chemprop code in a folder, and then just use the make_predictions
function. @Richiio please have a look and think how to do it also look at other Ersilia models that are using chemprop. As I shared in the last meeting the structure of Ersilia models cannot be changed . do not alter the run.sh / main.py framework please
Thankyou @HellenNamulinda @GemmaTuron. I'll work on that and look for a possible solution against our meeting tomorrow
Hi @Richiio and @HellenNamulinda
After looking at the model, this would be the cleanest and easiest solution, with the goal to maintain the service.py
file as is. @Richiio, taking the changes you have already added to the model repository, please:
$Create temporal folder for the features (inside code itself for example, you can reuse the $1 argument to specify its location)
$python $1/code/save_features.py --data_path $2 --save_path TEMP_Feat --checkpoints_dir $1/../checkpoints --features generator rdkit2d_normalized
$python $1/predict.py ...(and all the flags with the corresponding arguments)
$Delete temporal features folder
Does this make sense? This way we are not modifying the outer structure of the model
Alright @GemmaTuron I'll work on the necessary changes
Thanks for the help @GemmaTuron. The model has been refactored, tested and it works. Here is the pull request as regards that https://github.com/ersilia-os/eos5xng/pull/4
Hi @Richiio
Great, before merging the PR can you confirm in which location are the features saved temporarily? where is this folder created? And then, some small clean up before merging: please remember to delete the mock.txt file. And do we still need to install tensorboardX?
The features are saved in the framework directory. It's not a folder but rather it is saved as TEMP_feat.npz in the framework directory and deleted once predictions are made
I do not need TensorboardX for the model to work. Proceeding to delete that.
This model is incorporated
Model Name
ESKAPE pathogen inhibition
Model Description
Prediction of antimicrobial potential using a dataset of 29537 compounds screened against the antibiotic resistant pathogen Burkholderia cenocepacia. The model uses the Chemprop Direct Message Passing Neural Network (D-MPNN) abd has an AUC score of 0.823 for the test set. It has been used to virtually screen the FDA approved drugs as well as a collection of natural product list (>200k compounds), with hit rates of 26% and 12% respectively.
Slug
chemprop-eskape
Tag
ESKAPE,Antimicrobial activity
Publication
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9624395/
Source Code
https://github.com/cardonalab/Prediction-of-ATB-Activity
License
None