✍️ Contribution period: Masroor Hussain Shah

masroor07 commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[X] Write a motivation statement to work at Ersilia
[X] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[X] Select a model from the suggested list
[X] Install the model in your system
[X] Run predictions for the EML
[X] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[X] Suggest a new model an document it (1)
[X] Suggest a new model an document it (2)
[X] Suggest a new model an document it (3)

Week 4 - Prepare your final application

[X] Submit the final application in the Outreachy website

masroor07 commented 1 year ago

Hi @pauline-banye, How long does it take to create the environment for the ADME-NCATS model? Seems like it takes a lot of time. Sorry for the inconvenience! Thank you

Hi @masroor07 Which is the command you are using that gets stuck? The models must be downloaded manually btw

It finally completed downloading all the the dependencies but app.py didn't seem to work! I received following error ImportError: cannot import name 'escape' from 'jinja2'. I simply uninstalled Flask and reinstalled Flask to solve the problem. I am now downloading the models for testing.

Thank you

I was able to run the model PAMPA Permeability (pH 5.0). I was able to follow the simple instructions to install the model on my sub system(UBUNTU).

MODEL INTERPRETATION: The model provides predicted class (1 or 0) for a given compound. If the predicted class is '1', it means the compound is predicted to have 'low permeability' (i.e., log Peff < 1.0) and if the predicted class is '0', the compound is predicted to have 'moderate to high permeability' (i.e., log Peff > 1.0). The models also provide a probability score (between 0 and 1), shown in parentheses next to the predicted class.

Input to the model: I used 10 SMILES to test the model. 10_SMILES.csv Output: ADME_Predictions_2023-03-15-070819.csv

I will now try to compare the results of the ADME-NCATS model with model implemented in Ersilia (eos81ew).

GemmaTuron commented 1 year ago

Perfect many thanks @masroor07 ! We need to check we are getting the same values from the original model and the one we implemented at Ersilia. Please note we did some transformation in the Ersilia Model Hub to the results to give always the probability of 1, let's see if all is correctly working :)

masroor07 commented 1 year ago

Alright! thank you for the update. I will compare the results in a while.

paulinebanye commented 1 year ago

Great job on debugging @masroor07 😊. Glad you were able to get the NCATS ADME model working 👍

masroor07 commented 1 year ago

Great job on debugging @masroor07 😊. Glad you were able to get the NCATS ADME model working 👍 Thanks @pauline-banye !

I have been trying to fetch PAMPA5 using ersilia. But i get the following error:

mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '../../checkpoints'

I tried giving elevated privileges to the user but that doesn't seem to solve the problem.

ZakiaYahya commented 1 year ago

Great job on debugging @masroor07 😊. Glad you were able to get the NCATS ADME model working 👍 Thanks @pauline-banye !

I have been trying to fetch PAMPA5 using ersilia. But i get the following error:
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '../../checkpoints'
I tried giving elevated privileges to the user but that doesn't seem to solve the problem.

@masroor07, i'm getting the same error while fetching ersilia NCAT solubility model. If you find how to resolve it, let me know too. Thanks

paulinebanye commented 1 year ago

Great job on debugging @masroor07 😊. Glad you were able to get the NCATS ADME model working 👍 Thanks @pauline-banye !

I have been trying to fetch PAMPA5 using ersilia. But i get the following error:
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '../../checkpoints'
I tried giving elevated privileges to the user but that doesn't seem to solve the problem.

@ZakiaYahya @masroor07 Someone had this same issue and it was resolved by granting the user privileges.

The Permission error was due to the Model trying to create a directory, But the User had insufficient permission, So I had to grant the user Privileges.

https://github.com/ersilia-os/ersilia/issues/615#issuecomment-1470361591

masroor07 commented 1 year ago

Still facing the same issue. Tried giving the user privileges but that doesn't seem to solve my problem.

GemmaTuron commented 1 year ago

Hi @masroor07 and @ZakiaYahya

Are your users Admins of the computers you are using? You need to grant yourselves super-user privileges (this differs in windows, linux and macos). In Linux you can check this for example. As a workaround @masroor07 can you use the Colab implementation to run the model through Ersilia? @pauline-banye can you share more information on where the mkdir command is used? perhaps it wouldn't be necessary

paulinebanye commented 1 year ago

Hi @masroor07 and @ZakiaYahya

Are your users Admins of the computers you are using? You need to grant yourselves super-user privileges (this differs in windows, linux and macos). In Linux you can check this for example. As a workaround @masroor07 can you use the Colab implementation to run the model through Ersilia? @pauline-banye can you share more information on where the mkdir command is used? perhaps it wouldn't be necessary

Hi @GemmaTuron, the creation of the directory is not necessary anymore.

It was necessary when the models were downloaded directly. It checks if the directory exists and creates it if it doesn't.

The current setup with the models already downloaded and within the repository makes it unnecessary.

GemmaTuron commented 1 year ago

@pauline-banye then this should be removed from the code, legacy code specially if it requires special user permissions can give problems afterwards as we are seeing

ZakiaYahya commented 1 year ago

Hi @masroor07 and @ZakiaYahya

Are your users Admins of the computers you are using? You need to grant yourselves super-user privileges (this differs in windows, linux and macos). In Linux you can check this for example. As a workaround @masroor07 can you use the Colab implementation to run the model through Ersilia? @pauline-banye can you share more information on where the mkdir command is used? perhaps it wouldn't be necessary

Hello @GemmaTuron and @pauline-banye, i have successfully fetch the model by granting user privileges. Thanks.

paulinebanye commented 1 year ago

Hi @masroor07 and @ZakiaYahya

Are your users Admins of the computers you are using? You need to grant yourselves super-user privileges (this differs in windows, linux and macos). In Linux you can check this for example. As a workaround @masroor07 can you use the Colab implementation to run the model through Ersilia? @pauline-banye can you share more information on where the mkdir command is used? perhaps it wouldn't be necessary

Hello @GemmaTuron and @pauline-banye, i have successfully fetch the model by granting user privileges. Thanks.

That's fantastic @ZakiaYahya 👍

masroor07 commented 1 year ago

Hi @masroor07 and @ZakiaYahya

Are your users Admins of the computers you are using? You need to grant yourselves super-user privileges (this differs in windows, linux and macos). In Linux you can check this for example. As a workaround @masroor07 can you use the Colab implementation to run the model through Ersilia? @pauline-banye can you share more information on where the mkdir command is used? perhaps it wouldn't be necessary

Will try running the model using colab alright! I probably messed up with granting privileges, will try to work my way around it.

Thank you

masroor07 commented 1 year ago

Was finally able to fetch the model.

Fetching eos81ew done in time: 0:03:38.524597s
18:56:29 | INFO     | Fetching eos81ew done successfully: 0:03:38.524597

Tried running it for a sample smile:

 ersilia api run -i "CCCC"
{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "outcome": [
            1.970733478628972e-07
        ]
    }
}

masroor07 commented 1 year ago

Hello @masroor07 ! Great explanations, thanks for your dedication. As an extra task for this week, might I ask you to try to run the ADME-NCATS models? We have had some issues with @pauline-banye with some of them and I'd like to know if these are persistent across users. Could you:

Follow the installation detailed in the adme-ncats repo (please, use the development branch, not main)

Downlowad the pretrained model PAMPA5

Test 5 molecules on the ADME-NCATS model as well as on the model implemented in Ersilia (eos81ew) and compare the results

Comparison - ADME-NCATS(PAMPA5.0) AND NCATS-PAMPA5(eos81ew) Summary of the overall process. adme-ncats(PAMPA5.0): The model provides predicted class (1 or 0) for a given compound. If the predicted class is '1', it means the compound is predicted to have 'low permeability' (i.e., log Peff < 1.0) and if the predicted class is '0', the compound is predicted to have 'moderate to high permeability' (i.e., log Peff > 1.0). The models also provide a probability score (between 0 and 1), shown in parentheses next to the predicted class.

Input to the model: 10_SMILES.csv Output: ADME_Predictions_2023-03-15-070819.csv ncats-pampa5(eos81ew): Vitro surrogate to determine the permeability of drugs across cellular membranes. The Peff was converted to logarithmic, log Peff value lower than 2.0 were considered to have low to moderate permeability, and those with a value higher than 2.5 were considered as high-permeability compounds. Compounds with a value between 2.0 and 2.5 were omitted from the dataset.

Challenges during installation: The model requires the user to have elevated privileges which should not be the case. I was able to solve the problem by elevating the user's privileges and during the process, ended up corrupting a couple of my system files. I was able to resolve the issue by going through a discussion on stackoverflow. The error that was thrown: mkdir(name, mode) PermissionError: [Errno 13] Permission denied: '../../checkpoints'

Input to the model: 10_SMILES.csv

Output: processed.csv

GemmaTuron commented 1 year ago

Hi @masroor07

Great that you were able to run both models! The ADME NCATS and the implementation at Ersilia are the same model, so in the eos81ew you are simply describing the dataset used to train the model, but not the actual model output? Can you now compare the predictions you got for the same molecules on both models and see if they make sense? Since they are the same model they should be coinciding

paulinebanye commented 1 year ago

@pauline-banye then this should be removed from the code, legacy code specially if it requires special user permissions can give problems afterwards as we are seeing

Hi @GemmaTuron I can make the edits to the code right now. I'm creating a temporary fork and making a PR. I'm doing this for all the NCATS models.

masroor07 commented 1 year ago

Hi @masroor07

Great that you were able to run both models! The ADME NCATS and the implementation at Ersilia are the same model, so in the eos81ew you are simply describing the dataset used to train the model, but not the actual model output? Can you now compare the predictions you got for the same molecules on both models and see if they make sense? Since they are the same model they should be coinciding

Alright, will try to run predictions for same molecules on both the model.

masroor07 commented 1 year ago

Hi @masroor07

Great that you were able to run both models! The ADME NCATS and the implementation at Ersilia are the same model, so in the eos81ew you are simply describing the dataset used to train the model, but not the actual model output? Can you now compare the predictions you got for the same molecules on both models and see if they make sense? Since they are the same model they should be coinciding

Hi @GemmaTuron,

I tried running the predictions for various molecules on both the models. And yes, the outputs coincide. In eos81w, a probability of below 0.5 is considered highly permeable where as a probability of 0.5 or greater is considered low permeability. For example, If I run a prediction for the SMILE input Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is 0.903985857963562, which indicates that the molecule is not highly permeable.
But, for the SMILE input CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is 0.0090377377346158, which indicates that the molecule is moderate to highly permable.
And when i try running the predictions for the same molecules on ADME NCATS model, Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is low permeability with Predicted Class (Probability) equal to 1 and for CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is high permeability with Predicted Class (Probability) equal to 0

GemmaTuron commented 1 year ago

Hi @masroor07 Great that you were able to run both models! The ADME NCATS and the implementation at Ersilia are the same model, so in the eos81ew you are simply describing the dataset used to train the model, but not the actual model output? Can you now compare the predictions you got for the same molecules on both models and see if they make sense? Since they are the same model they should be coinciding

Hi @GemmaTuron,

I tried running the predictions for various molecules on both the models. And yes, the outputs coincide. In eos81w, a probability of below 0.5 is considered highly permeable where as a probability of 0.5 or greater is considered low permeability. For example, If I run a prediction for the SMILE input Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is 0.903985857963562, which indicates that the molecule is not highly permeable.

But, for the SMILE input CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is 0.0090377377346158, which indicates that the molecule is moderate to highly permable.

And when i try running the predictions for the same molecules on ADME NCATS model, Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is low permeability with Predicted Class (Probability) equal to 1 and for CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is high permeability with Predicted Class (Probability) equal to 0

This is a great explanation thanks @masroor07 !

masroor07 commented 1 year ago

Hi @masroor07 Great that you were able to run both models! The ADME NCATS and the implementation at Ersilia are the same model, so in the eos81ew you are simply describing the dataset used to train the model, but not the actual model output? Can you now compare the predictions you got for the same molecules on both models and see if they make sense? Since they are the same model they should be coinciding

Hi @GemmaTuron,

I tried running the predictions for various molecules on both the models. And yes, the outputs coincide. In eos81w, a probability of below 0.5 is considered highly permeable where as a probability of 0.5 or greater is considered low permeability. For example, If I run a prediction for the SMILE input Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is 0.903985857963562, which indicates that the molecule is not highly permeable.

But, for the SMILE input CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is 0.0090377377346158, which indicates that the molecule is moderate to highly permable.

And when i try running the predictions for the same molecules on ADME NCATS model, Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1, the output I get is low permeability with Predicted Class (Probability) equal to 1 and for CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1, the output is high permeability with Predicted Class (Probability) equal to 0

This is a great explanation thanks @masroor07 !

Thank you for the review!

masroor07 commented 1 year ago

Model:

OptiMol: Optimization of binding affinities in chemical space for drug discovery

Description:

An optimization pipeline that leverages complementary structure-based and ligand-based methods. The model introduces new Graph to SELFIES VAE. The model iteratively selects promising compounds in the chemical space using a ligand-centered generative model and then performs the molecular docking to guide compound optimization.

Slug:

optimol

Publication:

https://pubs.acs.org/doi/10.1021/acs.jcim.0c00833

Github Repository:

https://github.com/jacquesboitreaud/OptiMol

Summary

OptiMol model is implemented in Pytorch and DGL. The model was trained for 50 epochs using Adam optimizer. It proposes the optimization of the current state of art methods that do not leverage the structure of a target. A new VAE that is more computationally efficient while retaining state of the art results is introduced. Instead of performing docking on a fixed drug bank, promising compounds in the whole chemical space using a ligand-centered generative model are selected and molecular docking is then used as an oracle to guide compound optimization, allowing iterative generation of leads that betters fit the target structure. This oracle is costly and Bayesian optimization with recently published method : Conditioning by Adaptive Sampling was used to optimize the whole approach.

To generate compounds with high binding affinities, we could use one of the three binding affinity estimates:

Experimental bio-assays
Quantitative Structure Activity Relationship (QSAR)
Molecular docking softwares There methods can give good results but can be inaccurate. Optimol byepasses these limitations by resorting to a docking program, that uses physics-based molecular mechanics force fields to compute binding affinities.

License:

None

GemmaTuron commented 1 year ago

Hi @masroor07 !

Thanks for the detailed explanation, this model looks very relevant to some projects we are working at in Ersilia! LEt me give you a few extra pointers:

It is best to always cite the peer reviewed publication instead of the preprint, if available. In this case, it is published in Journal of Chemical Information and Modeling
The code link should bring us to the GitHub link, but it is the biorxiv still
Some extra information from the code, for example, which license it uses, would be good to have

masroor07 commented 1 year ago

Hi @masroor07 !

Thanks for the detailed explanation, this model looks very relevant to some projects we are working at in Ersilia! LEt me give you a few extra pointers:

It is best to always cite the peer reviewed publication instead of the preprint, if available. In this case, it is published in Journal of Chemical Information and Modeling

The code link should bring us to the GitHub link, but it is the biorxiv still

Some extra information from the code, for example, which license it uses, would be good to have

Thank you for the positive review and the extra pointers.

Updated the link to the publication on Journal of Chemical Information and Modeling
Seems like i had pasted the same link. Changed it to the Github repository.
Noted the third pointer as well!

GemmaTuron commented 1 year ago

Hi @masroor07 !

While you look for further models, can I ask you to include this model suggestion in our list?

Thanks!

masroor07 commented 1 year ago

Model 2:

MolGAN: An implicit generative model for small molecular graphs

Description:

A free generative model that that provides a way around the expensive graph matching procedures. The model adapts generative adversarial networks which is backed by RL to generate chemical molecules with desired properties.

Slug:

molgan

Publication:

https://arxiv.org/abs/1805.11973 (PDF)

Github Repository:

https://github.com/nicola-decao/MolGAN

Summary:

MolGAN is an implicit generative model for molecular graphs of small size which can be jointly trained with a GAN and a RL objective to generate molecular graphs with higher validity and novelty. It can achieve better chemical property scores and also removes the additional overhead when compared to the SMILES (which are generated from a graph based representation of molecules) based sequential GAN model for molecular generation. The model consists of three main components: generator, discriminator and a reward network.

The generator takes a sample and generates an annotated graph representation of a molecule.
The discriminator compares the the samples from dataset and generator and assigns scores to them.
The reward network is used to approximate the reward function. It also optimizes molecule generation. Dataset: The model is trained using QM9 dataset (a subset of GDB-17 chemical database)
LICENSE:

MIT

GemmaTuron commented 1 year ago

Hi @masroor07 !

Thanks, good model and very detailed description, I appreciate it! My only concern is that is a bit old (runs in Py3.6) let's hope it can be easily bumped to py 3.10!) Can you add it to the model list while you look for a third model suggestion?

masroor07 commented 1 year ago

Hi @masroor07 !

Thanks, good model and very detailed description, I appreciate it! My only concern is that is a bit old (runs in Py3.6) let's hope it can be easily bumped to py 3.10!) Can you add it to the model list while you look for a third model suggestion?

@GemmaTuron, Thank you for the positives! I did notice the python version but I guess there must be a way to make the required changes/updates to the code. Yes, sure! I will add it to the model list.

masroor07 commented 1 year ago

hile you look @GemmaTuron To add the model to the list, I had to fill the form right?

GemmaTuron commented 1 year ago

yes please: https://airtable.com/shroQLlkcmDcC0xzm

masroor07 commented 1 year ago

yes please: https://airtable.com/shroQLlkcmDcC0xzm

I filled the form! Thank you

masroor07 commented 1 year ago

Model 3:

EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings

Description:

EpitopeVec is model to predict the linear-B Peptides. It uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions.

Slug:

epitopevec

Publication:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8652027/

Github Repository:

https://github.com/hzi-bifo/epitope-prediction

Summary:

The region of an antigen recognized by antibodies is known as an epitope and if it is a continuous stretch of amino acids, it is a linear epitope. The identification of BCEs is important in many applications. EpitopeVec predicts the linear B-cell epitopes. It is a tool that combines commonly used propensity scales, residue features and modified antigenicity scales for vector representation of the peptides. It is based on a SVM model trained on a large set of experimentally verified epitopes and makes use of different amino acid features.

Datasets: EpitopeVec is trained on various small and large datasets derived from Bcipep and Immune Epitope Database (IEDB)

It takes protein sequences as input
Returns the list of peptides that can be epitopes.
License:

GPLv3

masroor07 commented 1 year ago

Hi @GemmaTuron, I was tring to reinstall NCATS-ADME on my system. I am facing a pickling issue i.e. `_pickle.UnpicklingError: invalid load key, '<'.'.

masroor07 commented 1 year ago

Hi @GemmaTuron, I was tring to reinstall NCATS-ADME on my system. I am facing a pickling issue i.e. `_pickle.UnpicklingError: invalid load key, '<'.'.

I was able to understand why a number of people face the pickling issue. The reason is simple: They had cloned the master branch of NCATS-ADME, which has outdated links and also, the code is not up to date. I cloned the "development" branch and that solved my issue. I just had to reinstall flask to solve the jinja2 issue.

masroor07 commented 1 year ago

Update:

I tried running the NCATS-HLM model. The model doesn't accept the input. I tried passing a csv to it. I also tried passing a text file to it but it doesn't process either of the input formats. The error message: There was an error processing your file. Please make sure you have selected a file that contains SMILES, indicate if the file contains a header and the column number containing the SMILES. Input csv file: 10_SMILES.csv I tried passing the same input file to PAMPA5 and and PAMPA7.4 as well. And they both were able to process the input file.

masroor07 commented 1 year ago

Update:

I tried running the NCATS-HLM model. The model doesn't accept the input. I tried passing a csv to it. I also tried passing a text file to it but it doesn't process either of the input formats. The error message: There was an error processing your file. Please make sure you have selected a file that contains SMILES, indicate if the file contains a header and the column number containing the SMILES. Input csv file: 10_SMILES.csv I tried passing the same input file to PAMPA5 and and PAMPA7.4 as well. And they both were able to process the input file.

I tried running the deployed HLM model which processes the input file. Here is the outputfile of the predictions ran using it: ADME_Predictions_2023-03-26-183023.csv

masroor07 commented 1 year ago

UPDATE:

PAMPA 5 AND PAMPA 7.4: I tried running predictions using both and no, PAMPA 5 doesn't show PAMPA50 as the model rather it shows PAMPA as the model. We can also observe a more precise probability score than PAMPA5 in the later version. I tried running the model on my local system and using the deployed version HERE as well.

Note: The authors should change the model output to PAMPA 7.4. That would make it more clear to understand the model that we are using.(Right?)

I ran tests for the first 10 smiles from the EML. INPUT: 10_SMILES.csv

OUTPUT:

Local test: ADME_Predictions_2023-03-26-212850.csv
Deployed model: ADME_Predictions_2023-03-26-211505 (WEB).csv

GemmaTuron commented 1 year ago

Hi @masroor07 !

That's great, so if I understand it correctly:

HLM model works
PAMPA5 works, with the results column stating PAMPA (only)
PAMPA7.4 works, with the results column stating? (sorry missed that one!)

If you can confirm that, I think we could reopen this issue and unarchive the model repo so that we can push the HLM model to the Hub as well. What do you think?

masroor07 commented 1 year ago

Hi @masroor07 !

That's great, so if I understand it correctly:

HLM model works

PAMPA5 works, with the results column stating PAMPA (only)

PAMPA7.4 works, with the results column stating? (sorry missed that one!)

If you can confirm that, I think we could reopen this issue and unarchive the model repo so that we can push the HLM model to the Hub as well. What do you think?

The HLM model that has been deployed to opendata works, yes.

No, PAMPA 7.4 works, shows PAMPA
PAMPA5 works, shows PAMPA50

GemmaTuron commented 1 year ago

I've reopened the issues to tackle them:

HLM incorporation: the issue is reopened - I suggest creating a new repo, and copying the code from another of the NCATS models but changing the model checkpoints for the appropriate ones?
PAMPA7.4: once HLM is working if there is time we can add the model as well

masroor07 commented 1 year ago

Submitted my final application Hello @GemmaTuron, Thank you for all the help during the contribution period! Got to learn a ton of things. A thank you to other contributors as well for all the help during the contribution phase. Look forward to keep contributing to Ersilia!

ersilia-os / ersilia