ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
220 stars 147 forks source link

🦠 Model Request: Antimicrobial potential #898

Closed GemmaTuron closed 10 months ago

GemmaTuron commented 11 months ago

Model Name

ESKAPE pathogen inhibition

Model Description

Prediction of antimicrobial potential using a dataset of 29537 compounds screened against the antibiotic resistant pathogen Burkholderia cenocepacia. The model uses the Chemprop Direct Message Passing Neural Network (D-MPNN) abd has an AUC score of 0.823 for the test set. It has been used to virtually screen the FDA approved drugs as well as a collection of natural product list (>200k compounds), with hit rates of 26% and 12% respectively.

Slug

chemprop-eskape

Tag

ESKAPE,Antimicrobial activity

Publication

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9624395/

Source Code

https://github.com/cardonalab/Prediction-of-ATB-Activity

License

None

GemmaTuron commented 11 months ago

/approve

github-actions[bot] commented 11 months ago

New Model Repository Created! 🎉

@GemmaTuron ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos5xng

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

Richiio commented 11 months ago

Hi @GemmaTuron Apologies for not updating my progress here.

I had began working on the models but didn't update that here. Here is a link to what I have done so far: https://github.com/ersilia-os/eos5xng/pull/1

HellenNamulinda commented 11 months ago

Hello @Richiio, On whether regression or classification tasks; here, we have to go with one.

In the source code, they provided checkpoints for 16 models, and we can't use all of them in one. @GemmaTuron, I believe we have to choose the best-performing model.

GemmaTuron commented 11 months ago

Hi @Richiio and @HellenNamulinda !

  1. Quick note on the metadata error, Hellen spotted it right on, we need to check why it gets converted to a list, but meanwhile just eliminating the list [ ] and leaving the description as a string should work
  2. Regarding the actual code: in main.py, you are loading all the models but not using them to make predictions, seems you are only calculating the molecular weight? predictions = [MolWt(Chem.MolFromSmiles(smi)) for smi in smiles_list] instead of this we should have the predictions using each model
  3. Reading the original publication, I found the following under the results section: From the eight different combinations, Model 6 (binary classification, scaffold split trained with RDKit descriptors, S2 Table) achieved the highest area under the curve of the precision-recall curve (PRC-AUC = 0.241), F1 Score (F1 = 0.104), Matthews correlation coefficient (MCC = 0.167) on the test set and was therefore selected as the primary model for our subsequent experiments. So I suggest we use this model only in our implementation
HellenNamulinda commented 11 months ago

Thanks @GemmaTuron for the clarification.

Richiio commented 11 months ago

Hi @HellenNamulinda @GemmaTuron. The publication and source code didn't specify the args they used during training. This results in the following error when testing the original code and when testing with ersilia

The original code

(chemprop) root@Richio:~/Prediction-of-ATB-Activity# chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --no_features_scaling --preds_path tox21_preds.csv
Loading training args
Traceback (most recent call last):
  File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
    sys.exit(chemprop_predict())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
    make_predictions(args=PredictArgs().parse_args())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 591, in wrap
    result = func(*args, **kwargs)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 392, in make_predictions
    ) = load_model(args, generator=True)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 31, in load_model
    update_prediction_args(predict_args=args, train_args=train_args)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 833, in update_prediction_args
    raise ValueError(
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).

From ersilia

(eos5xng) root@Richio:~/eos5xng/model/framework# bash run.sh . ~/Desktop/test.csv ~/Desktop/output.csv
Loading training args
Traceback (most recent call last):
  File "./code/main.py", line 41, in <module>
    outputs = my_model(smiles_list)
  File "./code/main.py", line 30, in my_model
    preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list)
  File "/root/eos5xng/model/framework/code/chemprop/utils.py", line 591, in wrap
    result = func(*args, **kwargs)
  File "/root/eos5xng/model/framework/code/chemprop/train/make_predictions.py", line 392, in make_predictions
    ) = load_model(args, generator=True)
  File "/root/eos5xng/model/framework/code/chemprop/train/make_predictions.py", line 31, in load_model
    update_prediction_args(predict_args=args, train_args=train_args)
  File "/root/eos5xng/model/framework/code/chemprop/utils.py", line 834, in update_prediction_args
    "Features were used during training so they must be specified again during "
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).

The error persists in both. I tried going through the publication to get the arguments they used in training but I haven't been able to get the exact arguments. Perhaps I missed a particular argument that generates this error. I've been on this for a while now.

GemmaTuron commented 11 months ago

Hi @Richiio

If I am not wrong, as arguments you only need the directory where the checkpoints are stored and the features used, which, as described in the text, are the rdkit descriptors, from what I understand in Table S2 (linked above) the normalized version:

    arguments = [
    '--checkpoint_dir', dir_model,
    '--features_generator', 'rdkit_2d_normalized',
    '--no_features_scaling'
    ]

I haven't seen mention of scaling anywhere

Richiio commented 11 months ago

Hi @GemmaTuron from the documentation, when using rdkit_2d_normalized features, --no_features_scaling must be specified as seen by the log below

Richiio commented 11 months ago
chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --preds_path tox21_preds.csv
Traceback (most recent call last):
  File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
    sys.exit(chemprop_predict())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
    make_predictions(args=PredictArgs().parse_args())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/tap/tap.py", line 478, in parse_args
    self.process_args()
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/args.py", line 923, in process_args
    super(PredictArgs, self).process_args()
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/args.py", line 220, in process_args
    raise ValueError('When using rdkit_2d_normalized features, --no_features_scaling must be specified.')
ValueError: When using rdkit_2d_normalized features, --no_features_scaling must be specified.
Richiio commented 11 months ago

@HellenNamulinda asked me to save the features and run predictions based on the saved features

GemmaTuron commented 11 months ago

Hi @Richiio I don't see you are passing the '--no_features_scaling' as an argument. Please pass it. Also, I am not sure we need the train path and test path for prediction only, see the example I pointed you to: https://github.com/ersilia-os/eos3804/

GemmaTuron commented 11 months ago

@HellenNamulinda asked me to save the features and run predictions based on the saved features

This is an option if we don't have a way of specifying the features but in this case we know them, so I'd avoid re-training the model which is quite complex. You simply need to add the flag --no_features_scaling as requested in the error log

GemmaTuron commented 11 months ago

The flag --no_features_scaling means exactly this, no features scaling is used, which is what I think is happening in this case

Richiio commented 11 months ago

This is my main.py file I used for the above

# imports
import os
import csv
import sys

import chemprop
# parse arguments
input_file = sys.argv[1]
output_file = sys.argv[2]

# current file directory
root = os.path.dirname(os.path.abspath(__file__))
dir_model= os.path.abspath(os.path.join(root,"..", "..","checkpoints", "classification-scaffold"))

# my model
def my_model(smiles_list):

    smiles_list_list= [[smiles] for smiles in smiles_list]  
    arguments = [
    '--test_path', '/dev/null',
    '--preds_path', '/dev/null',
    '--checkpoint_dir', dir_model,
    '--features_generator', 'rdkit_2d_normalized',
    '--no_features_scaling'
    ]

    args = chemprop.args.PredictArgs().parse_args(arguments)
    preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list)
    return preds

# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    smiles_list = [r[0] for r in reader]

# run model
outputs = my_model(smiles_list)

#check input and output have the same lenght
input_len = len(smiles_list)
output_len = len(outputs)
assert input_len == output_len

# write output in a .csv file
with open(output_file, "w") as f:
    writer = csv.writer(f)
    writer.writerow(["Probability_score"])  # header
    for o in outputs:
        writer.writerow(o)
GemmaTuron commented 11 months ago

Hi @Richiio Sorry I don't fully understand since your error message shows different arguments than the ones you are passing in main.py. This is what you shared just above: chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --preds_path tox21_preds.csv

Richiio commented 11 months ago

I'm sorry for not explaining well. I initially included the --no_features_scaling and got this error:

(chemprop) root@Richio:~/Prediction-of-ATB-Activity# chemprop_predict --test_path Raw_data_used_in_ML/test.csv --checkpoint_path classification-scaffold/model6-classification-scaffold-smiles-rdkit2dnorm/fold_0/model_0/model.pt --features_generator rdkit_2d_normalized --no_features_scaling --preds_path tox21_preds.csv
Loading training args
Traceback (most recent call last):
  File "/root/miniconda3/envs/chemprop/bin/chemprop_predict", line 8, in <module>
    sys.exit(chemprop_predict())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 506, in chemprop_predict
    make_predictions(args=PredictArgs().parse_args())
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 591, in wrap
    result = func(*args, **kwargs)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 392, in make_predictions
    ) = load_model(args, generator=True)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/train/make_predictions.py", line 31, in load_model
    update_prediction_args(predict_args=args, train_args=train_args)
  File "/root/miniconda3/envs/chemprop/lib/python3.8/site-packages/chemprop/utils.py", line 833, in update_prediction_args
    raise ValueError(
ValueError: Features were used during training so they must be specified again during prediction using the same type of features as before (with either --features_generator or --features_path and using --no_features_scaling if applicable).

I thought you wanted me to remove the --no_features_scaling I passed in. I showed that output to explain that I needed it. I didn't communicate that well

HellenNamulinda commented 11 months ago

Hello @Gemma, This model is quite similar to chemprop-antibiotic and also to the chemprop-sars-cov-inhibition that was incorporated by Miquel, where they could not provide the file for features_path directly, since the features depend on the molecules you want to make predictions on.

For this, you can use the save_features.py script provided and make predictions using these two commands

python scripts/save_features.py --data_path data/smiles.csv --save_path data/smiles_features.npz --features_generator rdkit_2d_normalized
python predict.py --test_path data/smiles.csv --checkpoint_dir checkpoint --preds_path test_preds.csv --features_path  data/smiles_features.npz
HellenNamulinda commented 11 months ago

This is my main.py file I used for the above

# imports
import os
import csv
import sys

import chemprop
# parse arguments
input_file = sys.argv[1]
output_file = sys.argv[2]

# current file directory
root = os.path.dirname(os.path.abspath(__file__))
dir_model= os.path.abspath(os.path.join(root,"..", "..","checkpoints", "classification-scaffold"))

# my model
def my_model(smiles_list):

    smiles_list_list= [[smiles] for smiles in smiles_list]  
    arguments = [
    '--test_path', '/dev/null',
    '--preds_path', '/dev/null',
    '--checkpoint_dir', dir_model,
    '--features_generator', 'rdkit_2d_normalized',
    '--no_features_scaling'
    ]

    args = chemprop.args.PredictArgs().parse_args(arguments)
    preds = chemprop.train.make_predictions(args=args, smiles=smiles_list_list)
    return preds

# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    smiles_list = [r[0] for r in reader]

# run model
outputs = my_model(smiles_list)

#check input and output have the same lenght
input_len = len(smiles_list)
output_len = len(outputs)
assert input_len == output_len

# write output in a .csv file
with open(output_file, "w") as f:
    writer = csv.writer(f)
    writer.writerow(["Probability_score"])  # header
    for o in outputs:
        writer.writerow(o)

Hello @Richiio, I haven't had power the whole day but will provide test results after running. But as discussed on call, we are to run this model in two steps.

  1. Have a script to save features like save_features.py
  2. And then a script for predicting (predict.py/main.py) Remember, predict.py/main.py has just
    
    """Loads a trained chemprop model checkpoint and makes predictions on a dataset."""

from chemprop.train import chemprop_predict

if name == "main": chemprop_predict()


By running the [above](https://github.com/ersilia-os/ersilia/issues/898#issuecomment-1847481171) two commands, you should be able to get predictions.
GemmaTuron commented 11 months ago

Thanks @HellenNamulinda !

Sorry I cannot dedicate more time to this today, @Richiio when you have time let us know if y are able to follow Hellen's steps, I'll catch up as soon as I can

Richiio commented 11 months ago

@GemmaTuron @HellenNamulinda Thanks for the help. The model works and has been tested.

HellenNamulinda commented 11 months ago

@Richiio It's great you were able to finally run the model, and I see the tests have passed.

GemmaTuron commented 11 months ago

Fantastic @Richiio Please when you are ready merge the code in your fork and let us know so we can have a look!

HellenNamulinda commented 11 months ago

Hello @GemmaTuron, The Model Test on PR passed.

However, we will go ahead with some refactoring. these models' run.sh file has two commands, different from just a single command python $1/code/main.py $2 $3. I will discuss this with Sarima during our meeting. @GemmaTuron, I have realized some chemprop and Grover models that require generating features before making predictions, and as such, they have two commands. If you could permit, I have a list of these and how they can be modified to have just the one usual command python $1/code/main.py $2 $3

GemmaTuron commented 11 months ago

Hi @HellenNamulinda !

Thanks for checking. Indeed, the run.sh file should not be modified, and the code that is now in run.sh should be part of main.py

It is better to incoporate all the chemprop code in a folder, and then just use the make_predictions function. @Richiio please have a look and think how to do it also look at other Ersilia models that are using chemprop. As I shared in the last meeting the structure of Ersilia models cannot be changed . do not alter the run.sh / main.py framework please

Richiio commented 11 months ago

Thankyou @HellenNamulinda @GemmaTuron. I'll work on that and look for a possible solution against our meeting tomorrow

GemmaTuron commented 11 months ago

Hi @Richiio and @HellenNamulinda

After looking at the model, this would be the cleanest and easiest solution, with the goal to maintain the service.py file as is. @Richiio, taking the changes you have already added to the model repository, please:

  1. Copy again the service.py file from the template, without modifications
  2. Work on the run.sh file so that it only requires the three arguments the service.py file passes ($1: framework directory, $2: input file, $3: output_file) This should be something like:
    $Create temporal folder for the features (inside code itself for example, you can reuse the $1 argument to specify its location)
    $python $1/code/save_features.py --data_path $2 --save_path TEMP_Feat --checkpoints_dir  $1/../checkpoints --features generator rdkit2d_normalized
    $python $1/predict.py ...(and all the flags with the corresponding arguments)
    $Delete temporal features folder

    Does this make sense? This way we are not modifying the outer structure of the model

Richiio commented 11 months ago

Alright @GemmaTuron I'll work on the necessary changes

Richiio commented 11 months ago

Thanks for the help @GemmaTuron. The model has been refactored, tested and it works. Here is the pull request as regards that https://github.com/ersilia-os/eos5xng/pull/4

GemmaTuron commented 11 months ago

Hi @Richiio

Great, before merging the PR can you confirm in which location are the features saved temporarily? where is this folder created? And then, some small clean up before merging: please remember to delete the mock.txt file. And do we still need to install tensorboardX?

Richiio commented 11 months ago

The features are saved in the framework directory. It's not a folder but rather it is saved as TEMP_feat.npz in the framework directory and deleted once predictions are made

Richiio commented 11 months ago

I do not need TensorboardX for the model to work. Proceeding to delete that.

GemmaTuron commented 10 months ago

This model is incorporated