ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
217 stars 148 forks source link

✍️ Contribution period: Paulina Boadiwaa Mensah #839

Closed Boadiwaa closed 12 months ago

Boadiwaa commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Boadiwaa commented 1 year ago

Bug report: Error running: ersilia -v run -i "CCCC" for model eos3b5e

Model is successfully served but using it to calculate the molecular weight of the molecule "CCCC" ends in a: TypeError: object of type 'NoneType' has no len()

logfile.txt

Kindly find attached the logfile.

For reference, I'm using WSL2, conda 23.7.4, Python 3.7.16

joiboi08 commented 1 year ago

Hi Paulina, please reference the solutions provided in #821 : Same issue encountered here

The recommended solution is trying to reinstall Ersilia as mentioned here There is also this route you can try to get it to work.

Boadiwaa commented 1 year ago

Thank you! @joiboi08 It is interesting but after reinstalling Ersilia several times I still run into the same error. I tried doing this: https://github.com/ersilia-os/ersilia/issues/821#issuecomment-1744581901 and I got the output.

@DhanshreeA would it be possible to investigate this later? I know you prefer that the code is not modified but at least three people have complained about this issue... perhaps we are doing something wrong? I'm curious to find out!

carcablop commented 1 year ago

Hello @Boadiwaa Welcome to Ersilia!. Provides detailed information about your system and development environment, Ersilia installation steps, and steps to run the test model. Please can you provide the python version and conda version?. For python versions 3.7 and conda version 23.5.2, there are no errors in the installation and execution of ersilia models. If your version of conda is the latest, could you try installing miniconda version 23.5.2? First of all clean your base environment.

DhanshreeA commented 1 year ago

Hi @Boadiwaa thanks for your efforts. We seem to be running into some issues with WSL. Ideally ersilia should work with a reasonably old version of conda and with Python versions above 3.7. However as @carcablop mentioned, there are couple of things you can do here: try using a different conda version/or reinstall conda, or a different Python version and please report your findings here.

Boadiwaa commented 1 year ago

@carcablop @DhanshreeA I cleaned my base environment multiple times and reinstalled conda version 23.5.2 and Python 3.7.16 but ran into the same error. All the preceeding steps work fine, but the last step to calculate the molecular weight throws the TypeError I documented.

I'm moving ahead to the other tasks as I got the output after modifying the script (though now I know not to), but it would be an interesting challenge to figure out why the TypeError throws up with a few of the WSL users.

I am providing a step-by-step plan of my Ersilia Installation on here:

carcablop commented 1 year ago

Hi @Boadiwaa. Try uninstall the isaura package and try running the model again. It seems that Isaura is causing that error.

Boadiwaa commented 1 year ago

@carcablop you were right! After uninstalling Isaura I got the expected output with no errors! Thank you for sticking with this issue and figuring it out. Moving forward, how do you think we can factor in Isaura without facing issues?

Boadiwaa commented 1 year ago

My Week 1 Experience in the Contribution Phase with Ersilia

Task 1 - I joined the Slack Channels and the atmosphere in the community was warm. Everyone on there seems eager to learn and the energy is contagious. It's great figuring things out with helpful teammates. Fun stuff!

Task 2- I have had some exposure to GitHub Issues but this is the first time I am using it this extensively. It's been only a few days but I have learnt a lot. I'm hooked!

Task 3 - This was an interesting task for me:

  1. I got fired up seeing a "live demo" of a model from Ersilia's hub and seeing the usefulness of even the simplest model. This was a fun tutorial and well-written. Big thumbs up to the authors. :)
  2. I ran into an issue. It's cool when you get everything to work on the first try but a lot of learning happens from facing challenges and pushing through. I'm happy I could finally get to the finish line. Shout-outs to the team members who joined forces to try and sort this and the leaders @DhanshreeA and @carcablop for not ignoring our SoS messages! :)
  3. I guess my major takeaways were that A. (almost) everything is figure-out-able if you stick with the problem long enough, and don't give up too easily! B. Speed is important, but it shouldn't be at the expense of building a solid base.

I'm enjoying my time here! Thank you Ersilia, Outreachy and everyone involved for making this happen. I know I'll be piling up a bagful of both soft and hard skills.

Onwards!

Boadiwaa commented 1 year ago

My Motivation Statement for Joining Ersilia

I come from a medical background, and I pursued my interests in technology by studying programming and Machine Learning as an autodidact. I have been looking for opportunities that would challenge me to learn further, hone my skills and collaborate with like-minded people who want to solve important problems. After graduating from medical school, I approached the Global Health and Infectious Diseases Research Group at the Kumasi Centre for Collaborative Research into Tropical Medicine, Ghana, with a proposal: I believed Artificial intelligence could be applied to the work they were already doing to supercharge it and I was willing to put it to the test. The initial response was not enthusiastic as they had never had someone work in Artificial Intelligence in the Group, but I asked them to let me volunteer as an intern for six months. Within a year, the results were evident, and I was given more opportunity and responsibility: serving at various times as a data analyst, a research assistant, and most recently an Artificial Intelligence in Global Health Coordinator Support. I have taken advantage of the opportunity I worked for, to produce research papers, present at conferences, and run workshops all aimed at advancing awareness and knowledge on Data Science and Artificial Intelligence for improved Healthcare delivery in Africa.

When I found out about Outreachy my research informed me that it would be the ideal learning opportunity I had been looking for, so I jumped at it. Not only was I drawn to the challenging tasks, but also the opportunity for cross-border collaboration made me eager to be a part of it. With the few days I have been here, I have been thrilled to see that my expectations were not wrong. When I had to decide on a project to apply to, Ersilia was a no-brainer: it fit perfectly with my interests, past work and aspirations in applying Artificial Intelligence to improve health research and healthcare delivery in sub-Saharan Africa, and eventually, globally. I had confirmation that I had made the right choice when I experienced how streamlined the onboarding process was. The well-detailed guides for new interns made me confident of Ersilia's support system - this is a community that cares not only about the code, but the people behind it as well.

Improving Healthcare in Lower and Middle Income countries is a Herculean and ambitious task and it requires real dedication - some would even say, a calling. No one person can do it alone, and I am very excited to see that an organization like Ersilia exists for that purpose. My vision perfectly aligns with Ersilia's - it's like this opportunity was waiting ahead for me all this while! I am eager to learn, contribute, and grow with this community. I cannot wait to see the positive impact we will have when we work together with impassioned curiosity, resilience and an unquenchable thirst for learning and progress.

DhanshreeA commented 1 year ago

Hi @Boadiwaa thank you for the updates. You can get started with week 2 tasks.

Boadiwaa commented 1 year ago

Week 2 with Ersilia

TASK 1 I selected the "NCATS Rat Liver Microsomal Stability" model for 2 reasons:

  1. The usefulness of the model - The ability to predict the stability of compounds as they undergo hepatic metabolism is highly crucial in drug discovery and development. This model predicts the stability of compounds in rat microsomes, and it's output is relevant for studies in humans down the line in the drug development process. As a clinician, I have experienced firsthand how damaging it can be when drugs are unavailable to treat particular diseases, or when pathogen develop resistance to particular drugs and so new ones must be discovered or developed. I am especially excited that the model's performance is not restricted to a small subset of compounds. I am doubly excited and curious to explore this model!
  2. The type of model - I have been exploring Neural Networks recently, and I employed one in an attempt to predict gait speeds in older Ghanaian adults. (Findings of preliminary results are documented here: https://drive.google.com/drive/my-drive ). This would be an opportunity to learn more about Neural Networks, and this would be my introduction to Graph Convoluted Neural Networks in particular. I am eager to observe it's performance.

TASK 2 I spent quite a lot of time here and I realized that is because strictly following the source code in the NCATS-ADME repository () leads to all the models on their platform being downloaded, and not just the NCATS Rat Liver Microsomal Stability model I want to focus on this week. Here are the steps I took to install the model.

  1. I opened my terminal with the Windows Key + R and then inputting "cmd" in the pop-up window.
  2. Since I am using Windows Subsystem for Linux, I entered "wsl" in my terminal and that started my pre-installed Ubuntu 20.04.6 LTS
  3. Next I cloned the model's GitHub repository using: "git clone --recursive https://github.com/ncats/ncats-adme.git" The output is recorded in this file: output1.txt
  4. I changed my directory to "ncats-adme", which is the directory the repository was cloned into. This was my code: (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli$ cd ncats-adme
  5. I then inputted the "ls"command to see the contents of the directory. Here is the code and output: (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ ls 20210917 Dockerfile-opendata Jenkinsfile-opendata client server Dockerfile-ncats Jenkinsfile-ncats README.md default.profraw
  6. Next, I moved into the "server"sub-directory. (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ cd server (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme/server$
  7. I created the required environment using conda env create --prefix ./env -f environment.yml Output: output2.txt
  8. I activated the environemnt with: conda activate ./env
  9. I run the model's app with python app.py

Here is where things got a little rough. After various connection breaks the 6 models on the platform were loaded successfully. However the script got stuck with the CYP450 models. The connection kept breaking when it got to model_21 of cyp3a4_subs (models for substrate of CYP3A4 isozyme). Here is the logfile: output3.txt

Since these additional models weren't necessary for my task, I opened the app.py file in my working/current directory and commented out this line of code: from predictors.cyp450.cyp450_predictor import CYP450Predictor

Finally the app run successfully: final_output.txt

Here is a screenshot of the homepage:

homepage

TASK 3 I downloaded the eml_canonical.csv file from Ersilia's Github repository (found under "notebooks"), and I uploaded it into the webapp. Here is a screenshot of the predicitions that run successfully:

predictions

And here is a logfile showing the output at the backend whilst the model run the predictions: prediction_phase.txt

From the output at the backend, the model took approximately 3.7s to predict 442 molecules which is pretty impressive! Here is the intepretation of the first 10 molecules: The first column is a representation of the physical structure of the molecule, the 2nd shows the model's prediction, where 0 maps to the "stable" class and 1 maps to the "unstable" class. The probability of each prediction is shown in brackets. Example for row 2, per the model's prediction, the probability that the molecule is unstable in a rat's liver microsomes is 0.71 which is not as strong a prediction as for molecule 8 which has a probability of 0.98 (98% likely to be unstable in a rat's liver microsomes).

TASK 4 To run the model via Ersilia's Model Hub, I went to the hub's website: ersilia.io/model-hub and clicked on the "Microsomal Stability" tab. I then scrolled down to the "Rat liver microsomal stability" section and expanded it so I could read a summary of the model. Here I got further clarification on what "stable" and "unstable" meant which was really helpful: unstable meant the compound had an in-vitro half-life of ≤30 mins in a rat's liver microsomes, whereas stable meant >30 min. I then clicked on the "GitHub" tab which took me to the model's GitHub page as curated by Ersilia.

Here, in the README.md file, I got some more information on the model, including the interpretation of it's output: Interpretation: Probability of a compound being unstable in RLM assay (half-life ≤ 30min)

ersilia_rlm

Then, I got to work in my terminal! I changed into my ersilia directory and activated the environment with these lines of code: wsl cd ersilia conda activate ersilia

Next, I opened Docker Desktop ( which I had installed already ) and run ersilia -v fetch eos5505 in my terminal. After a series of outputs I got this message: 👍 Model eos5505 fetched successfully! Then, I run ersilia serve eos5505. My output was:

🚀 Serving model eos5505: ncats-rlm

URL: http://0.0.0.0:48535 PID: -1 SRV: pulled_docker

👉 To run model:

💁 Information:

Next, I run ersilia -v api --help to gain further clarification on the parameters required to successfully run the model. This was the output: Usage: ersilia api [OPTIONS] [API_NAME]

Run API on a served model

Options: -i, --input TEXT [required] -o, --output TEXT -b, --batch_size INTEGER -q, --quiet Hide all warnings and info logs --help Show this message and exit.

Finally, I run, ersilia -v api predict -i /mnt/c/Users/pauli/Downloads/eml_canonical.csv -o output_2.csv The output of the command is in the file below: ersilia_compare.txt

Then, I opened my output csv file to compare the results with what was outputted on the ADME@NCATS webapp.

output_2_csv

The "rlm_proba1" column in the CSV matches what is stated in the GitHub README file about the model's output. Comparing the first 10 predictions of the CSV file with the webapp's output, I realized the following similarity and difference: Similarity:

Differences

TASK 5 As I had already installed both the Docker Daemon and Docker Desktop on my computer and they have been working perfectly, I did not re-install them.

That concludes my Week 2 tasks! I'm glad I persisted through the trying times when installing the models proved challenging with connectivity issues. Onwards!

Boadiwaa commented 1 year ago

### Week 3 with Ersilia TASK 1 After searching through various publications with PubMed and Google Scholar, I came across the Open Drug Discovery Toolkit (ODDT) and I think it would be of interest to Ersilia. It is a free and open-source tool, packaged as a Python library for drug discovery that bundles various machine learning methods and some external software commonly used in drug discovery pipelines, as one package. Examples of algorithms it provides are for filtering molecules, docking ligands, and evaluating models using RF-scores for tasks such as predicting compound activities.

It would be relevant for the Ersilia Model hub as it provides a one-stop solution for many tasks needed in a typical drug discovery workflow.

Link to Journal Publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0078-2#Sec9 Link to GitHub repository: https://github.com/oddt/oddt Link to Library's documentation: https://oddt.readthedocs.io/en/latest/index.html License: BSD-3-Clause License

How I installed it

test1_oddt.txt

Output:

imatinib

TASK 2 My 2nd model to suggest is DeePred-BBB. It is a free, open-source Deep Neural Network-based + Convolutional Neural Network-1 Dimension (CNN-1D)-based model that takes in the SMILES notation of a compound and returns a prediction as to whether the compound is permeable through the Blood Brain Barrier or not. It outputs 0 for non-permeable and 1 for permeable. The features used in creating the model were also engineered via PaDEL , a free software to calculate molecular descriptors and fingerprints.

I think this model would be of interest to Ersilia because the Blood Brain Barrier (BBB) is a significant structure in the human body and the ability of compounds to cross this barrier influences what they are used for. Permeability of the BBB is a deciding factor in the manufacturing of anaesthetics, muscle relaxants and many other examples. Its relevance cannot be overstated in drug development as it strongly influences why one drug is chosen over the other for specific purposes even if they share a similar drug class. Furthermore, the others reported high prediction accuracies (>97%) during model development. Finally, the model is easy to implement, with clear steps for installation documented in its Github repo.

Link to Journal Publication: https://www.frontiersin.org/articles/10.3389/fnins.2022.858126/full Link to GitHub repository: https://github.com/12rajnish/DeePred-BBB Data model was trained on: https://www.frontiersin.org/articles/10.3389/fnins.2022.858126/full#supplementary-material

How I installed it

Next, I decided to test the model on the Essential Medicines list from Ersilia's repository. This file is in .csv format and since the model requires a .smi file as input, I created a python script to convert the Essential Medicines list into a .smi format that includes the names of the compounds and their SMILES notation, as required for the predictions to be generated. Here is the Python script I created: csv_to_smi.txt

I saved this as a Python file (with a .py extension) in my working directory and run it with python csv_to_smi.py My output was an eml.smi file in the working directory. I opened the file and made these modifications:

  1. I deleted the header row ("smiles", "name") leaving only the actual SMILES notations and the name of the compounds.
  2. For compounds with compound names e.g. "acetylsalicylic acid", I took out the spaces and replaced with underscores, e.g. "acetylsalicylic_acid". Here is a subset of my eml.smi file (in text format) eml.txt
  3. I then re-run the command: python DeePred-BBB_Script.py. This logfile shows the output: eml_predict.txt Here is the csv file generated showing the model's predictions for the subset of the compounds I shared above: DeePred-BBB_predictions.csv

With the exception of acetic acid, all the other outputs were either 0 or 1. I would have to further investigation the prediction for acetic acid.

TASK 3 My 3rd model suggestion is HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction It is an open-source deep learning architecture consisting of a 3-dimensional convolutional neural network utilizing channel-wise attention and two graph convolutional networks. It takes in a protein structure file and a ligand structure file and outputs the binding affinity between the two structures as pKd and/or show a pictorial representation.

I think this model would be of interest to Ersilia because protein-ligand binding affinity measures how strongly a potential drug (ligand) can attach to its target protein in the body for optimal results. Knowing binding affinity early in the drug discovery process can expedite the process by helping to identify the most promising drug candidates and discarding those with poor binding affinity. The model can be accessed either as a Python package: https://pypi.org/project/HACNet/ or via it's Google Colab notebook: https://colab.research.google.com/github/gregory-kyro/HAC-Net/blob/main/HACNet.ipynb#scrollTo=3EjmhS9bEE2I or via the GitHub repository linked below:

Link to Journal Publication: https://arxiv.org/abs/2212.12440 Link to GitHub repository: https://github.com/gregory-kyro/HAC-Net Data used to train model: Available in the GitHub repo linked above as open source License: MIT License

How I tested the model I tested the model via the Google Colab notebook on some protein-ligand pairs I downloaded from the PDBbind-CN database. I uploaded the structure files into the Colab session and then copied and pasted their paths into the required cell in the notebook. Here are screenshots of my inputs and the output from the model:

Input:

prot-lig

Output:

prot_lig_output

AND THAT CONCLUDES MY WEEK THREE TASKS! THIS HAS BEEN AN ENGAGING AND INSIGHTFUL EXPERIENCE...AND I HOPE TO HAVE MORE OF IT!

DhanshreeA commented 1 year ago

Week 2 with Ersilia

TASK 1 I selected the "NCATS Rat Liver Microsomal Stability" model for 2 reasons:

1. **The usefulness of the model** - The ability to predict the stability of compounds as they undergo hepatic metabolism is highly crucial in drug discovery and development. This model predicts the stability of compounds in rat microsomes, and it's output is relevant for studies in humans down the line in the drug development process. As a clinician, I have experienced firsthand how damaging it can be when drugs are unavailable to treat particular diseases, or when pathogen develop resistance to particular drugs and so new ones must be discovered or developed. I am especially excited that the model's performance is not restricted to a small subset of compounds. I am doubly excited and curious to explore this model!

2. **The type of model** - I have been exploring Neural Networks recently, and I employed one in an attempt to predict gait speeds in older Ghanaian adults. (Findings of preliminary results are documented here: https://drive.google.com/drive/my-drive ). This would be an opportunity to learn more about Neural Networks, and this would be my introduction to Graph Convoluted Neural Networks in particular. I am eager to observe it's performance.

TASK 2 I spent quite a lot of time here and I realized that is because strictly following the source code in the NCATS-ADME repository () leads to all the models on their platform being downloaded, and not just the NCATS Rat Liver Microsomal Stability model I want to focus on this week. Here are the steps I took to install the model.

1. I opened my terminal with the Windows Key + R and then inputting "cmd" in the pop-up window.

2. Since I am using Windows Subsystem for Linux, I entered "wsl" in my terminal and that started my pre-installed Ubuntu 20.04.6 LTS

3. Next I cloned the model's GitHub repository using: "git clone --recursive https://github.com/ncats/ncats-adme.git" The output is recorded in this file:
   [output1.txt](https://github.com/ersilia-os/ersilia/files/12886744/output1.txt)

4. I changed my directory to "ncats-adme", which is the directory the repository was cloned into. This was my code:
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli$ cd ncats-adme`

5. I then inputted the "ls"command to see the contents of the directory. Here is the code and output:
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ ls 20210917          Dockerfile-opendata  Jenkinsfile-opendata  client           server Dockerfile-ncats  Jenkinsfile-ncats    README.md             default.profraw`

6. Next, I moved into the "server"sub-directory.
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ cd server (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme/server$`

7. I created the required environment using `conda env create --prefix ./env -f environment.yml`
   Output:
   [output2.txt](https://github.com/ersilia-os/ersilia/files/12886812/output2.txt)

8. I activated the environemnt with: `conda activate ./env`

9. I run the model's app with  `python app.py`

Here is where things got a little rough. After various connection breaks the 6 models on the platform were loaded successfully. However the script got stuck with the CYP450 models. The connection kept breaking when it got to model_21 of cyp3a4_subs (models for substrate of CYP3A4 isozyme). Here is the logfile: output3.txt

Since these additional models weren't necessary for my task, I opened the app.py file in my working/current directory and commented out this line of code: from predictors.cyp450.cyp450_predictor import CYP450Predictor

Finally the app run successfully: final_output.txt

Here is a screenshot of the homepage: homepage

TASK 3 I downloaded the eml_canonical.csv file from Ersilia's Github repository (found under "notebooks"), and I uploaded it into the webapp. Here is a screenshot of the predicitions that run successfully: predictions

And here is a logfile showing the output at the backend whilst the model run the predictions: prediction_phase.txt

From the output at the backend, the model took approximately 3.7s to predict 442 molecules which is pretty impressive! Here is the intepretation of the first 10 molecules: The first column is a representation of the physical structure of the molecule, the 2nd shows the model's prediction, where 0 maps to the "stable" class and 1 maps to the "unstable" class. The probability of each prediction is shown in brackets. Example for row 2, per the model's prediction, the probability that the molecule is unstable in a rat's liver microsomes is 0.71 which is not as strong a prediction as for molecule 8 which has a probability of 0.98 (98% likely to be unstable in a rat's liver microsomes).

TASK 4 To run the model via Ersilia's Model Hub, I went to the hub's website: ersilia.io/model-hub and clicked on the "Microsomal Stability" tab. I then scrolled down to the "Rat liver microsomal stability" section and expanded it so I could read a summary of the model. Here I got further clarification on what "stable" and "unstable" meant which was really helpful: unstable meant the compound had an in-vitro half-life of ≤30 mins in a rat's liver microsomes, whereas stable meant >30 min. I then clicked on the "GitHub" tab which took me to the model's GitHub page as curated by Ersilia.

Here, in the README.md file, I got some more information on the model, including the interpretation of it's output: Interpretation: Probability of a compound being unstable in RLM assay (half-life ≤ 30min)

ersilia_rlm

Then, I got to work in my terminal! I changed into my ersilia directory and activated the environment with these lines of code: wsl cd ersilia conda activate ersilia

Next, I opened Docker Desktop ( which I had installed already ) and run ersilia -v fetch eos5505 in my terminal. After a series of outputs I got this message: 👍 Model eos5505 fetched successfully! Then, I run ersilia serve eos5505. My output was:

🚀 Serving model eos5505: ncats-rlm

URL: http://0.0.0.0:48535 PID: -1 SRV: pulled_docker

👉 To run model:

* run

These APIs are also valid:

* predict

💁 Information:

* info

Next, I run ersilia -v api --help to gain further clarification on the parameters required to successfully run the model. This was the output: Usage: ersilia api [OPTIONS] [API_NAME]

Run API on a served model

Options: -i, --input TEXT [required] -o, --output TEXT -b, --batch_size INTEGER -q, --quiet Hide all warnings and info logs --help Show this message and exit.

Finally, I run, ersilia -v api predict -i /mnt/c/Users/pauli/Downloads/eml_canonical.csv -o output_2.csv The output of the command is in the file below: ersilia_compare.txt

Then, I opened my output csv file to compare the results with what was outputted on the ADME@NCATS webapp.

output_2_csv

The "rlm_proba1" column in the CSV matches what is stated in the GitHub README file about the model's output. Comparing the first 10 predictions of the CSV file with the webapp's output, I realized the following similarity and difference: Similarity:

* The interpretation of the probabilities outputted from running the model via Ersilia matched the prediction in the Prediction column on the ADME@NCATS webapp

Differences

* The probabilities from running via Ersilia was approximated to three decimal places whereas that from the webapp was approximated to two decimal places.

* The probabilities from running via Ersilia are to be interpreted with respect to only Class 1 ("unstable") whereas that from the webapp could be interpreted with respect to either of the two Classes.

* There were some slight differences in the probabilities outputted via Ersilia and via the webapp. For example, the probability in being unstable, for compound 2 via Ersilia's output was 0.595 whereas from the webapp it was 0.71.
ersiliahub_vs_admencats_webapp

TASK 5 As I had already installed both the Docker Daemon and Docker Desktop on my computer and they have been working perfectly, I did not re-install them.

That concludes my Week 2 tasks! I'm glad I persisted through the trying times when installing the models proved challenging with connectivity issues. Onwards!

Hi @Boadiwaa thank you for the very detailed updates and all your efforts. You mentioned that for some molecules there are differences between the outputs from the NCATS repo for RLM model, vs the Ersilia implementation. If possible, and if you have the time, could you provide a csv file here with the following columns (col1: smile, col 2: original_output (unstable only), col 3: ersilia output). While decimal rounding is not a major issue, outright differences in the values is a cause for concern and we'd like to investigate further.

Thanks a lot!

Also you can proceed with the week 3 tasks.

Boadiwaa commented 1 year ago

Week 2 with Ersilia

TASK 1 I selected the "NCATS Rat Liver Microsomal Stability" model for 2 reasons:

1. **The usefulness of the model** - The ability to predict the stability of compounds as they undergo hepatic metabolism is highly crucial in drug discovery and development. This model predicts the stability of compounds in rat microsomes, and it's output is relevant for studies in humans down the line in the drug development process. As a clinician, I have experienced firsthand how damaging it can be when drugs are unavailable to treat particular diseases, or when pathogen develop resistance to particular drugs and so new ones must be discovered or developed. I am especially excited that the model's performance is not restricted to a small subset of compounds. I am doubly excited and curious to explore this model!

2. **The type of model** - I have been exploring Neural Networks recently, and I employed one in an attempt to predict gait speeds in older Ghanaian adults. (Findings of preliminary results are documented here: https://drive.google.com/drive/my-drive ). This would be an opportunity to learn more about Neural Networks, and this would be my introduction to Graph Convoluted Neural Networks in particular. I am eager to observe it's performance.

TASK 2 I spent quite a lot of time here and I realized that is because strictly following the source code in the NCATS-ADME repository () leads to all the models on their platform being downloaded, and not just the NCATS Rat Liver Microsomal Stability model I want to focus on this week. Here are the steps I took to install the model.

1. I opened my terminal with the Windows Key + R and then inputting "cmd" in the pop-up window.

2. Since I am using Windows Subsystem for Linux, I entered "wsl" in my terminal and that started my pre-installed Ubuntu 20.04.6 LTS

3. Next I cloned the model's GitHub repository using: "git clone --recursive https://github.com/ncats/ncats-adme.git" The output is recorded in this file:
   [output1.txt](https://github.com/ersilia-os/ersilia/files/12886744/output1.txt)

4. I changed my directory to "ncats-adme", which is the directory the repository was cloned into. This was my code:
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli$ cd ncats-adme`

5. I then inputted the "ls"command to see the contents of the directory. Here is the code and output:
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ ls 20210917          Dockerfile-opendata  Jenkinsfile-opendata  client           server Dockerfile-ncats  Jenkinsfile-ncats    README.md             default.profraw`

6. Next, I moved into the "server"sub-directory.
   `(base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme$ cd server (base) boadiwaa@PaulinaMensah:/mnt/c/Users/pauli/ncats-adme/server$`

7. I created the required environment using `conda env create --prefix ./env -f environment.yml`
   Output:
   [output2.txt](https://github.com/ersilia-os/ersilia/files/12886812/output2.txt)

8. I activated the environemnt with: `conda activate ./env`

9. I run the model's app with  `python app.py`

Here is where things got a little rough. After various connection breaks the 6 models on the platform were loaded successfully. However the script got stuck with the CYP450 models. The connection kept breaking when it got to model_21 of cyp3a4_subs (models for substrate of CYP3A4 isozyme). Here is the logfile: output3.txt Since these additional models weren't necessary for my task, I opened the app.py file in my working/current directory and commented out this line of code: from predictors.cyp450.cyp450_predictor import CYP450Predictor Finally the app run successfully: final_output.txt Here is a screenshot of the homepage: homepage TASK 3 I downloaded the eml_canonical.csv file from Ersilia's Github repository (found under "notebooks"), and I uploaded it into the webapp. Here is a screenshot of the predicitions that run successfully: predictions And here is a logfile showing the output at the backend whilst the model run the predictions: prediction_phase.txt From the output at the backend, the model took approximately 3.7s to predict 442 molecules which is pretty impressive! Here is the intepretation of the first 10 molecules: The first column is a representation of the physical structure of the molecule, the 2nd shows the model's prediction, where 0 maps to the "stable" class and 1 maps to the "unstable" class. The probability of each prediction is shown in brackets. Example for row 2, per the model's prediction, the probability that the molecule is unstable in a rat's liver microsomes is 0.71 which is not as strong a prediction as for molecule 8 which has a probability of 0.98 (98% likely to be unstable in a rat's liver microsomes). TASK 4 To run the model via Ersilia's Model Hub, I went to the hub's website: ersilia.io/model-hub and clicked on the "Microsomal Stability" tab. I then scrolled down to the "Rat liver microsomal stability" section and expanded it so I could read a summary of the model. Here I got further clarification on what "stable" and "unstable" meant which was really helpful: unstable meant the compound had an in-vitro half-life of ≤30 mins in a rat's liver microsomes, whereas stable meant >30 min. I then clicked on the "GitHub" tab which took me to the model's GitHub page as curated by Ersilia. Here, in the README.md file, I got some more information on the model, including the interpretation of it's output: Interpretation: Probability of a compound being unstable in RLM assay (half-life ≤ 30min)

ersilia_rlm

Then, I got to work in my terminal! I changed into my ersilia directory and activated the environment with these lines of code: wsl cd ersilia conda activate ersilia Next, I opened Docker Desktop ( which I had installed already ) and run ersilia -v fetch eos5505 in my terminal. After a series of outputs I got this message: 👍 Model eos5505 fetched successfully! Then, I run ersilia serve eos5505. My output was: 🚀 Serving model eos5505: ncats-rlm URL: http://0.0.0.0:48535 PID: -1 SRV: pulled_docker 👉 To run model:

* run

These APIs are also valid:

* predict

💁 Information:

* info

Next, I run ersilia -v api --help to gain further clarification on the parameters required to successfully run the model. This was the output: Usage: ersilia api [OPTIONS] [API_NAME] Run API on a served model Options: -i, --input TEXT [required] -o, --output TEXT -b, --batch_size INTEGER -q, --quiet Hide all warnings and info logs --help Show this message and exit. Finally, I run, ersilia -v api predict -i /mnt/c/Users/pauli/Downloads/eml_canonical.csv -o output_2.csv The output of the command is in the file below: ersilia_compare.txt Then, I opened my output csv file to compare the results with what was outputted on the ADME@NCATS webapp.

output_2_csv

The "rlm_proba1" column in the CSV matches what is stated in the GitHub README file about the model's output. Comparing the first 10 predictions of the CSV file with the webapp's output, I realized the following similarity and difference: Similarity:

* The interpretation of the probabilities outputted from running the model via Ersilia matched the prediction in the Prediction column on the ADME@NCATS webapp

Differences

* The probabilities from running via Ersilia was approximated to three decimal places whereas that from the webapp was approximated to two decimal places.

* The probabilities from running via Ersilia are to be interpreted with respect to only Class 1 ("unstable") whereas that from the webapp could be interpreted with respect to either of the two Classes.

* There were some slight differences in the probabilities outputted via Ersilia and via the webapp. For example, the probability in being unstable, for compound 2 via Ersilia's output was 0.595 whereas from the webapp it was 0.71.
ersiliahub_vs_admencats_webapp

TASK 5 As I had already installed both the Docker Daemon and Docker Desktop on my computer and they have been working perfectly, I did not re-install them. That concludes my Week 2 tasks! I'm glad I persisted through the trying times when installing the models proved challenging with connectivity issues. Onwards!

Hi @Boadiwaa thank you for the very detailed updates and all your efforts. You mentioned that for some molecules there are differences between the outputs from the NCATS repo for RLM model, vs the Ersilia implementation. If possible, and if you have the time, could you provide a csv file here with the following columns (col1: smile, col 2: original_output (unstable only), col 3: ersilia output). While decimal rounding is not a major issue, outright differences in the values is a cause for concern and we'd like to investigate further.

Thanks a lot!

Also you can proceed with the week 3 tasks.

@DhanshreeA please find the csv file comparing the original model's output vs Ersilia's implementation's output for the compounds classified as "unstable". I have highlighted entries with significant discrepancies between the two models' in red. I am also quite curious to know why this might be the case! adme_ncat_vs_ersilia.csv

Would you like me to do something similar for the "stable" predictions as well? I could subtract the Ersilia version's output from 1 to find the corresponding figure for the stable classes and then compare to ADME-NCAT's original figure.

GemmaTuron commented 12 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!