ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
203 stars 131 forks source link

🦠 Model Request: Prediction of drug metabolites using neural machine translation #507

Closed carcablop closed 1 year ago

carcablop commented 1 year ago

Model Name

Prediction of drug metabolites

Model Description

Based on transfer learning, this model predicts drug metabolites. First, a Transformer model (the specifications were based on the Molecular Transformer: https://github.com/pschwllr/MolecularTransformer ) is pre-trained on a set of chemical reactions. The model is then fitted to a data set of human metabolic transformations. Finally, a final model is created, which is a set of multiple fitted models. The output would be the union of the predictions of each model (sequences of SMILES of the possible metabolites).

Slug

drug-metabolites

Tags

metabolites

Publication

https://pubs.rsc.org/en/content/articlelanding/2020/sc/d0sc02639e#fn1

Code

https://github.com/KavrakiLab/MetaTrans

License

BSD 3-Clause License

miquelduranfrigola commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@carcablop ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos935d

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

carcablop commented 1 year ago

This drug metabolite prediction problem, in this model, is treated as a SMILES sequence translation problem. To do the translation, it is based on a model pre-trained with chemical reaction data, then fits the model based on the "molecular transformer" model (Molecular transformer) on a dataset of metabolic transformations to predict the outcome of metabolic reactions in the humans. This step is based on the openNMT toolkit. Since multiple and diverse metabolites are to be taken into account in the model, finally an ensemble model is created consisting of multiple fitted models (6 models previously downloaded from this page: pre-trained models ), in this case for six previously fitted models. Each model takes as input a sequence of smiles and predicts the sequence of metabolite smiles. Finally, the output is a sequence of smiles that consists of the union of the sets of predicted metabolites of each of the models. Below is an example of the output file. Prediction of metabolites using the molecular transformation model. Outputs of each of the models: model1_beam5.txt model2_beam5.txt model3_beam5.txt model4_beam5.txt model5_beam5.txt model6_beam5.txt

Finally, the output file of the metabolite prediction by applying the ensemble model (the union of the outputs of the previous models) predict_metabolitos.csv

carcablop commented 1 year ago

Update: The model has been run successfully on my local machine. It was installed in a conda environment with python 3.7. Dependency problems that occurred when trying to translate smile sequences have been resolved: -IPython module was not found: it was solved by installing the module inside the virtual environment: pip install ipython -No module comet_ml. It was solved by installing the module. pip install comet_ml

Once the problems are solved and obtain the predictions of drug metabolites. I forked the repository and cloned the repository. Currently working on the model code. https://github.com/carcablop/eos935d/commit/010579dd483b5b6eeda7d0657008aa44d0043260.

carcablop commented 1 year ago

Hello @GemmaTuron and @miquelduranfrigola I've been working on the model code. I have added the code and the parameters of the model in their respective folders (framework and checkpoints). I have also modified the main.py file and I have tested the model with the changes that I have made and I have been able to obtain the output files of the predictions for the 6 models. Try to make very minimal changes to the original code. These changes I wanted to upload to my repository and I get the following error: imagen

DhanshreeA commented 1 year ago

Hi @carcablop I got the same error while trying to push my changes to remote as well. It is because "Git LFS on github.com does not currently support pushing LFS objects to public forks. GitHub Enterprise does support this behavior" (from the most downvoted comment on this issue: https://github.com/git-lfs/git-lfs/issues/1906#issuecomment-276602035 :sweat_smile: )

What I did was disabled git lfs by uninstalling and then pushed. I am not sure if that's the best way though especially since you maybe dealing with multiple large models. @miquelduranfrigola @GemmaTuron what do you suggest?

GemmaTuron commented 1 year ago

Hi @DhanshreeA ,

This is a problem indeed, we cannot deactivate Git LFS because the models are too large and we need to store them there. As a temporal solution, I am adding these repos to our GitHub team, so you will get in principle push access to them. This is not ideal because I need to enable each repo manually but can help for the moment Try again and let me know!

carcablop commented 1 year ago

Thanks @DhanshreeA Before the commit I had gotten a warning about git lfs, and I hadn't paid attention to it. I'll try again @GemmaTuron. Thanks a lot.

carcablop commented 1 year ago

Hello @GemmaTuron I already pushed the changes. Here are my changes so far: https://github.com/carcablop/eos935d/commit/e1801248eaa1fd66bdd04cd91c9c47a42f36c844.

carcablop commented 1 year ago

Hi @GemmaTuron and @miquelduranfrigola This is the output file of the predictions, with the input file provided by Ersilia : "eml_canonical.csv". To get the predictions it takes an hour. eos935d_test.csv https://github.com/carcablop/eos935d/commit/cb87d2ca9d967faae7b93e6a03a76c223bf53f44

GemmaTuron commented 1 year ago

Hi @carcablop ! Great, so you think the model is ready for someone else to test it? Please, when you can, also update the README file with a bit of information about the model. Thanks!

carcablop commented 1 year ago

Hello @GemmaTuron . The model is not ready. Sorry, I didn't explain how I got the output in the previous step. To test the model in the previous step, I run the run_predict.sh script with the conda environment installed with all the packages. Like this: cd ~/Desktop FRAMEWORK_PATH="eos935d/model/framework/" bash $FRAMEWORK_PATH/run_predict.sh $FRAMEWORK_PATH eml_canonical.csv output.csv This way I got the output. That proves that my main.py is working.

Now I have modified the docker file and I took into account what @miquelduranfrigola recommended (specify the versions of each installed package, so that in the future we do not have version incompatibility problems). And I'm running the fetch with the command: ersilia -v fetch eos935d --repo_path= miruta. This is taking too long, and the fetch is still running on my computer. imagen I will wait for this to finish executing, and I will continue updating it.

carcablop commented 1 year ago

This is generating an error, maybe it's because I haven't modified the service.py file. imagen

My output is a sequence of smiles. Should I change the service.py file to this? imagen

carcablop commented 1 year ago

Hello @GemmaTuron @miquelduranfrigola @Amna-28. I am trying to fetch the model that I am going to incorporate into Ersilia, but it throws me the following errors. (Attached is the log ). I get errors that it can't find some models like pandas, and torch. Even though in the docker file I am specifying the packages to install. I would like it if someone could take a look at my docker file, maybe I have not specified this correctly. I really appreciate the help. In another conda environment, outside of the Ersilia environment, it runs fine.

This is the log of the fetch: log_eos935d.txt

this is the Docker File. imagen

carcablop commented 1 year ago

Update: Within the ersilia environment. I installed the packages manually, from the console with the ersilia environment activated. conda install rdkit -c rdkit conda install future six tqdm pandas conda install pytorch=1.1.0 torchvision -c pytorch pip install torchtext==0.3.1 pip install -e .

Although this runs fine outside of Ersilia, when running inside of Ersilia I get a python syntax error on one of the files where the input data is processed. It seems strange to me since this works fine outside of Ersilia. Should I modify something else in the service.py file? I share the fetch output log. eos935d_log_ersilia.txt

GemmaTuron commented 1 year ago

Hi @carcablop :

Are you installing the packages on the Ersilia environment or on the eos935d? They should be installed on the Model environment on the base ersilia environment --> if you did that I'd suggest removing ersilia and installing again so that you don't have packages that shouldn't be there Can you activate the eos935d and check which version of pandas and torch do you have, if any? these are the packages that give errors.

Can you provide more information on the input you are passing? A single molecule, a list?

carcablop commented 1 year ago

Hello @GemmaTuron To fetch the model, I run the following command:

  1. conda activate ersilia.
  2. ersilia -v fetch eos935d --repo_path /home/carcablop/eos935d

I will reinstall ersilia again. And I will run the fetch again.

GemmaTuron commented 1 year ago

Hi @carcablop

When you fetch a model, it automatically creates its own conda environment with the name of the model. In that environment is where you need to see if the dependencies are installed. Please check the logs you shared and you will see the steps indicating the conda environment is being created and which packages are being installed there.

carcablop commented 1 year ago

Hello Gemma. I activate the model environment eos935d. And I list the installed dependencies. conda_packages_eos935d.txt

These are the conda environments: imagen

carcablop commented 1 year ago

Hello @GemmaTuron Update: Finally, I have successfully fetched the model thanks to Miquel. The first error was related to the subprocess function that I use in the main.py. It was necessary to specify the python path of the active conda environment, in this case, the environment created from the model (eos935d). For this, I created a script that returns the python path of the conda environment created from the model. This variable is passed as input to the subprocess function. (I am in the task of continuing to improve this script, currently, on any machine, you can fetch, but outside of Ersilia this script would need modifications). Then it was necessary to modify the service.py file so that it wrote a file with three columns in the same way as I do in the script prepapre_input_file.py, which I use to read and process the input data. I updated all the code in my repository and the model was ready to be tested by someone else. https://github.com/carcablop/eos935d

miquelduranfrigola commented 1 year ago

Hello @carcablop ! The model worked successfully in our GitHub Actions workflow. You will see a "failed build" but this is unrelated to your model (it is related to the AirTable; I will solve it asap).

So I think we can close the issue. Amazing work, @carcablop .

GemmaTuron commented 1 year ago

Before closing this issue, let's make sure the model is tested. I am assigning this to @Femme-js , @DhanshreeA and @pauline-banye ! Please comment once you can confirm it works on your system and in google colab (using the newest template we edited)

GemmaTuron commented 1 year ago

Following our latest discussion with the team:

paulinebanye commented 1 year ago

These are very valid observations @GemmaTuron, I haven't actually considered what happens if a wrong smile is passed. It is important to try this with other models as well.

carcablop commented 1 year ago

Following our latest discussion with the team:

* Does the model work if we pass a .csv file with a single list of SMILES?

* If one of the SMILES is incorrect, does the model skip it or it crashes?

Hi @GemmaTuron The result when I passed an incorrect molecule as input, the model crashes. !ersilia api predict -i "XX1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"

imagen

carcablop commented 1 year ago

Hello. Everyone. Has anyone been able to fetch the eos935d model in the CLI? I have tried on my local system and this fails to install this future, rdkit, pandas, pytorch and torchvision in the environment. I think I'm having problems with my windows subsystem for linux, I get errors similar to the ones I got when I tried to install the dependencies for the model Paulina is working on. If anyone wants to take a look at the error, when installing those libraries I get package conflict errors. Like these: imagen

This is the log of the errors: log_eos935d_fails_dependencies_isntalled.txt

I would appreciate it if you could test the model, since I have tested it in google colab and it does not fail, and I am suspecting that my windows subsystem for linux is failing and I want to be sure that the model does not present problems when fetching on your local systems. Thank you.

carcablop commented 1 year ago

Following our latest discussion with the team:

* Does the model work if we pass a .csv file with a single list of SMILES?

Hi Gemma. I also tried passing it a .csv file with a list (a list of two molecules) and it executed correctly in Google Colab. This is the input file: lista_columna.csv

And this is the output. eos935d_output.csv

paulinebanye commented 1 year ago

Hello. Everyone. Has anyone been able to fetch the eos935d model in the CLI? I have tried on my local system and this fails to install this future, rdkit, pandas, pytorch and torchvision in the environment. I think I'm having problems with my windows subsystem for linux, I get errors similar to the ones I got when I tried to install the dependencies for the model Paulina is working on. If anyone wants to take a look at the error, when installing those libraries I get package conflict errors. Like these: imagen

This is the log of the errors: log_eos935d_fails_dependencies_isntalled.txt

I would appreciate it if you could test the model, since I have tested it in google colab and it does not fail, and I am suspecting that my windows subsystem for linux is failing and I want to be sure that the model does not present problems when fetching on your local systems. Thank you.

Hi @carcablop great job! Yeah I can test it out on the CLI. I'll keep you updated

DhanshreeA commented 1 year ago

Hi @GemmaTuron The result when I passed an incorrect molecule as input, the model crashes. !ersilia api predict -i "XX1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"

I think this is to be expected anytime you pass a random string instead of an actual molecule as a smile to rdkit, its MolFromSmile functionality will crash. This will also apply to every model that expects a SMILE input - I guess this is something we can just update in the model incorporation template repo to reflect graceful handling of garbage input to the model. However this sort of error handling may not even be needed because our users are less likely to just play around with these models with the intention to crash them.

However, an input that's a valid SMILE but not a valid input for the model (requires some domain knowledge to realize where this might happen), that sort of a corner case will be different for each model, and figuring that, at least as an outsider to the field, might be an iterative process. For example, in the case of CReM, this happened only for a few inputs. When I ran the model with a random sample (10-15) of the 442 mols in eml_canonical, the model did not crash. However when I tested it with the entire list of mols, then it crashed for a few, thus leading me to realize that the model can crash and that it crashes with a specific exception.

What do you think @GemmaTuron and @miquelduranfrigola?

DhanshreeA commented 1 year ago

Hey @carcablop The model works well on CLI as well as Colab. Here's the CLI output csv and colab link. eos935d_output.csv https://colab.research.google.com/drive/1se8VBT2X0yak1UNK2vcByNYwjoPQ3nFS?usp=sharing

carcablop commented 1 year ago

Hello. Everyone. Has anyone been able to fetch the eos935d model in the CLI? I have tried on my local system and this fails to install this future, rdkit, pandas, pytorch and torchvision in the environment. I think I'm having problems with my windows subsystem for linux, I get errors similar to the ones I got when I tried to install the dependencies for the model Paulina is working on. If anyone wants to take a look at the error, when installing those libraries I get package conflict errors. Like these: imagen

This is the log of the errors: log_eos935d_fails_dependencies_isntalled.txt

I would appreciate it if you could test the model, since I have tested it in google colab and it does not fail, and I am suspecting that my windows subsystem for linux is failing and I want to be sure that the model does not present problems when fetching on your local systems. Thank you.

Hello @GemmaTuron and @miquelduranfrigola. I have solved this error :). To solve this error, I uninstalled miniconda and reinstalled it, and the package conflict errors were resolved. The error was typical of my conda but I could not find exactly what was generating the error, so I had to reinstall conda again and create a clean ersilia environment.

  1. To uninstall miniconda: rm -rf ~/miniconda3. The model fetched successfully on my machine.
GemmaTuron commented 1 year ago

Hi @carcablop thanks for the update This model is pending to betested by @DhanshreeA @pauline-banye and @Femme-js ! Can you confirm it works?

DhanshreeA commented 1 year ago

Hi @carcablop thanks for the update This model is pending to betested by @DhanshreeA @pauline-banye and @Femme-js ! Can you confirm it works?

  • [x] Carolina
  • [ ] Pauline
  • [x] Dhanshree
  • [ ] Jeevanshi

Hi @GemmaTuron I've tested it on my CLI and on Colab - the model works. I've posted results in the comments above. :)

DhanshreeA commented 1 year ago

Hi @GemmaTuron The result when I passed an incorrect molecule as input, the model crashes. !ersilia api predict -i "XX1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"

I think this is to be expected anytime you pass a random string instead of an actual molecule as a smile to rdkit, its MolFromSmile functionality will crash. This will also apply to every model that expects a SMILE input - I guess this is something we can just update in the model incorporation template repo to reflect graceful handling of garbage input to the model. However this sort of error handling may not even be needed because our users are less likely to just play around with these models with the intention to crash them.

However, an input that's a valid SMILE but not a valid input for the model (requires some domain knowledge to realize where this might happen), that sort of a corner case will be different for each model, and figuring that, at least as an outsider to the field, might be an iterative process. For example, in the case of CReM, this happened only for a few inputs. When I ran the model with a random sample (10-15) of the 442 mols in eml_canonical, the model did not crash. However when I tested it with the entire list of mols, then it crashed for a few, thus leading me to realize that the model can crash and that it crashes with a specific exception.

What do you think @GemmaTuron and @miquelduranfrigola?

@GemmaTuron there's also another comment I'd like your input on.

GemmaTuron commented 1 year ago

Hi @GemmaTuron The result when I passed an incorrect molecule as input, the model crashes. !ersilia api predict -i "XX1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"

I think this is to be expected anytime you pass a random string instead of an actual molecule as a smile to rdkit, its MolFromSmile functionality will crash. This will also apply to every model that expects a SMILE input - I guess this is something we can just update in the model incorporation template repo to reflect graceful handling of garbage input to the model. However this sort of error handling may not even be needed because our users are less likely to just play around with these models with the intention to crash them. However, an input that's a valid SMILE but not a valid input for the model (requires some domain knowledge to realize where this might happen), that sort of a corner case will be different for each model, and figuring that, at least as an outsider to the field, might be an iterative process. For example, in the case of CReM, this happened only for a few inputs. When I ran the model with a random sample (10-15) of the 442 mols in eml_canonical, the model did not crash. However when I tested it with the entire list of mols, then it crashed for a few, thus leading me to realize that the model can crash and that it crashes with a specific exception. What do you think @GemmaTuron and @miquelduranfrigola?

@GemmaTuron there's also another comment I'd like your input on.

Hi @DhanshreeA , good points. The issue you found is related to SMILES standardisation, which might differ slightly per model. The model will treat equally a random string and a non-valid smiles, so a module to handle errors as exceptions instead of crashing should work equally for both. The issue is to identify exactly what standardisation system (if any) the model uses. As a rule of thumb, if rdkit can read the molecule it will be ok to go

GemmaTuron commented 1 year ago

This model incorporation is complete. I will close this issue and we will do the model testing as detailed in https://github.com/ersilia-os/eos935d/issues/2