ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

189 stars 123 forks source link

✍️ Contribution period: Tsion Zeleke #880

Closed TsionZerihun closed 7 months ago

TsionZerihun commented 8 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

TsionZerihun commented 8 months ago

Task 1-3

I joined the slack channel and introduced myself :heavy_check_mark:

I opened this issues on git :heavy_check_mark:

Install pre-requisite & ersilia :heavy_check_mark:

pre-requisite


@DESKTOP-92JJ0KD:~$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

@DESKTOP-92JJ0KD:~$ git version git version 2.17.1

DESKTOP-92JJ0KD:~$ git-lfs install Git LFS initialized. @DESKTOP-92JJ0KD:~$ git lfs --version git-lfs/3.4.0 (GitHub; linux amd64; go 1.20.6; git d06d6e9e)

(ersilia) @DESKTOP-92JJ0KD:~$ conda list isaura

Name Version Build Channel

isaura 0.1 pypi_0 pypi

@DESKTOP-92JJ0KD:~$ docker --version Docker version 20.10.21, build 20.10.21-0ubuntu1~18.04.3

- ### *Installed ersilia successfully  without error*
```console
@DESKTOP-92JJ0KD:~/ersilia$ ersilia --help
Usage: ersilia [OPTIONS] COMMAND [ARGS]...
---
  ...🦠 Welcome to Ersilia! 💊

@DESKTOP-92JJ0KD:~/ersilia$ ersilia catalog...
[
    {
        "Identifier": "eos1086"
    } ...

TsionZerihun commented 8 months ago

Test the simplest model eos3b5e :heavy_check_mark:

I successfully fetched and run the model


(ersilia) @DESKTOP-92JJ0KD:~$ ersilia -v fetch eos3b5e
⬇️  Fetching model eos3b5e: molecular-weight...
---
...👍 Model eos3b5e fetched successfully!

(ersilia) @DESKTOP-92JJ0KD:~$ ersilia serve eos3b5e 🚀 Serving model eos3b5e: molecular-weight URL: http://127.0.0.1:42699...

...💁 Information:

info

- ### *However, I faced an error when running prediction*

@DESKTOP-92JJ0KD:~$ ersilia -v run -i "CCC" > my.log 2>&1

Check attached error log file for more detail.
[model_eos3b5e_error.log](https://github.com/ersilia-os/ersilia/files/13115534/model_eos3b5e_error.log)

- ### *I tried removing isaura  based on this [discussion](https://github.com/ersilia-os/ersilia/issues/839)*

@DESKTOP-92JJ0KD:~$ python -m pip uninstall isaura Found existing installation: isaura 0.1 Uninstalling isaura-0.1: ... Successfully uninstalled isaura-0.1


- ### *I was able to successfully run prediction after uninstalling isaura and rerunning*

@DESKTOP-92JJ0KD:~$ ersilia -v api run -i "CCCC" { "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }


Check attached prediction success log file for more detail.
[model_eos3b5e_sucess.log](https://github.com/ersilia-os/ersilia/files/13118376/model_eos3b5e_sucess.log)

TsionZerihun commented 8 months ago

Key takeaways from completing above tasks

- Lesson Learned

Isura: after facing issues running model when isura was install, I went on to see what this particular python package did. Its purpose is to cache previously calculated properties. (which is located under ersilia repository)

I will make sure to reinstall isaura if caching is necessary when running future models.*

"CCCC": "IUPAC name for C-C-C-C is Hex-2-ene, and contains-only carbon chain"

- Summary

The above task runs the model eos3b5e with the molecule we provide which is "CCCC" , and determines its molecular weight in g/mol.

TsionZerihun commented 8 months ago

Motivation Letter

Self Introduction

I'd like to introduce myself as a person who is enthusiastic about automation and its impacts. In addition to studying computer science, I completed a 1 year intensive software engineering program with a focus on backend development at ALX. Over the past few years, I have worked in various sectors, including full-stack development and data analysis, utilizing my research and analytical skills to improve smallholder farmers' lives.

How I was Introduced to Machine Learning

I was first introduced to ML when working on a project on developing a REST API for a Y-Maze test. It was based on a pretrained model aiming to automate laboratory researchers' tasks. The project initially seemed intimidating, but with intensive research, I learned a lot and am grateful for the exposure.

Ersilia

It was fascinating to see that Ersilia has already begun the path of utilizing currently released technologies to help the world. I think that technology has the potential to change people's lives in different sectors, but that potential is not fully utilized. There is yet to be discovered. Research centers and labs are not as fully automated as they could be. It was quite exciting when I ran the previous (ersila week 1) task with minimal commands and setups. I was able to run a test, which might have taken a significant amount of time, energy, and skill. Ersilia, in my opinion, is achieving what most software engineers and I hope to achieve, which is using various technologies to improve the world and serve as an inspiration for future innovations. I believe we are in the perfect time where initiatives and ideas can be easily brought to light, but many people still lack knowledge and tools. Growing open-source projects like Ersilia can assist individuals learn and contribute to the community.

Conclusion

It would be a great pleasure to join a nonprofit organization that aims to assist experts in discovering new drugs for treating infectious and neglected diseases using the latest technologies, making it convenient and time-saving. My long-term goal is to work on a project that improves people's lives. I have the vision to enhance my country and the world by coming up with new solutions to various problems. I would love to be a part of the team so that I can give back to the community. I'm eager to contribute to and learn from Ersilia!

Best, Tsion Zeleke

TsionZerihun commented 8 months ago

Week 2

Select a model from the suggested list(Task 1) :heavy_check_mark:

I selected Rat Liver Microsomal Stability(RLM) from NCATS
- Microsomal Stability :The metabolism of a new chemical entity(drugs) or its time dependent decrease in the incubation mixtures containing liver microsomes
So it uses rats microsomes to tests its microsomal stability which helps in screening of drug candidates in the early stage of drug development*

TsionZerihun commented 8 months ago

Install the model in your system (Task 2):heavy_check_mark:

I cloned this Repo

(base) robel@DESKTOP-92JJ0KD:~$ git clone --recursive https://github.com/ncats/ncats-adme.git
Cloning into 'ncats-adme'...
remote: Enumerating objects: 3640, done....
...
(base) robel@DESKTOP-92JJ0KD:~$ ls
bentoml  eos  ersilia  miniconda3  model_eos3b5e_error.log  model_eos3b5e_sucess.log  ncats-adme

I commented out other models since this slack instruction mentioned that the server will download every model at once, which would take a lot of time


---
#from predictors.hlm.hlm_predictor import HLMPredictior
#from predictors.pampa.pampa_predictor import PAMPAPredictior
#from predictors.pampa50.pampa_predictor import PAMPA50Predictior
#from predictors.pampabbb.pampa_predictor import PAMPABBBPredictior...
---
...def predict(): ...
---
for model in models:
    response[model] = {}
    error_messages = []

    # if model.lower() == 'hlm':
    #     predictor = HLMPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    if model.lower() == 'rlm':
        predictor = RLMPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    # elif model.lower() == 'pampa':
    #    predictor = PAMPAPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    #elif model.lower() == 'pampa50':
    #    predictor = PAMPA50Predictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    # elif model.lower() == 'pampabbb':
    #    predictor = PAMPABBBPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    elif model.lower() == 'solubility':
        predictor = SolubilityPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    # elif model.lower() == 'hlc':
    #    predictor = LCPredictor(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
    # elif model.lower() == 'cyp450':
    #    predictor = CYP450Predictor(kekule_mols = working_df['mols'].values, smiles=working_df[smi_column_name].values)
    else:
        break...

- ### *I faced an error when trying to create the environment based on the installation guide in the Repo*
    * I modified the command from the official Repo from `conda env create --prefix ./env -f environment.yml` to `conda env create --prefix ./env -f server/environment.yml`since they have moved the enviroment.yml file to the server directory 
```console
@DESKTOP-92JJ0KD:~/ncats-adme$ cd server
@DESKTOP-92JJ0KD:~/ncats-adme/server$ conda env create --prefix ./env -f environment.yml
Collecting package metadata (repodata.json): - Killed

I was getting collecting package metadata (repodata.json): \ Killed error
- I Found out from this post that it was a RAM problem slack
- Because I was using WSL, the Ubuntu terminal was only allotted half of my computer's RAM
- I followed the instruction in this video to increase the memory allocated to my WSL Ubuntu
- Finally, I was able to set up the environment
```
(base) @DESKTOP-92JJ0KD:~/ncats-adme$ conda activate "./env"
(/home/robel/ncats-adme/env) robel@DESKTOP-92JJ0KD:~/ncats-adme$
#Environment successfully changed to env
```

Next step was running app.py

faced moduleModuleNotFoundError: error typed-argument-parser, request, HealthCheck , flask-swagger-ui
I manually installed them using `pip' and run app.py again

The serve successfully run, in localhost:5000


(/home/robel/ncats-adme/env) robel@DESKTOP-92JJ0KD:~/ncats-adme/server$ python app.py
Loading RLM graph convolutional neural network model...
---
...WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

Running on all addresses (0.0.0.0)
Running on http://127.0.0.1:5000

Running on http://172.25.10.172:5000 Press CTRL+C to quit...


![server_url](https://github.com/ersilia-os/ersilia/assets/101357449/7d87d516-7a72-43f2-9609-a3cd0d0be079)

TsionZerihun commented 8 months ago

Run predictions for the EML (Task 3) :heavy_check_mark:

I Opened localhost:5000 → Predict → uncheck all except RLM → Text File → Browse file → Process File
I downloaded the csv file with the list of drugs to be tested from here
- Select has header to "Yes" and smiles column number "1"
It was mind blowing to see 442 drugs being tested for stability in Rats Liver(Microsome) within minutes 💥

What this outcome indicates

Smiles	Prediction	Probablity
chemical notation of the drug in a way that can be used by the computer	Unstable: If drug undergoes biotransformation in less than 30min Stable: >30min	The odds of the prediction. (Probability of stability or instability of the drug based on prediction)

Here is the csv file of the prediction. ADME_Predictions_2023-10-27-164557.csv

I was curios to see how many percentage of the drugs were stable and saw that 256 from total 442 were stable (around 58.5%). I filtered and put the stable and unstable drugs in two tabs and attached it here.

ADME_Predictions_Stable_Unstable.csv.xlsx

TsionZerihun commented 8 months ago

Compare results with the Ersilia Model Hub implementation! (Tak 4) :heavy_check_mark:

I searched for the RLM model in ersilia repo (eos5505)
- In my terminal cd ersilia, activated ersilia env conda activate ersilia and fetched the RML model from ersilia's repo using ersilia -v fetch eos5505

Understanding Ersilia's backend.

I am currently using GIT LFS to fetch the model as shows below in the terminal

(ersilia) robel@DESKTOP-92JJ0KD:~/ersilia$ ersilia -v fetch eos5505
⬇️  Fetching model eos5505: ncats-rlm
17:37:59 | DEBUG    | Initialized with URL: None
17:37:59 | DEBUG    | Trying to find an available URL where the model is hosted
17:38:07 | DEBUG    | Git LFS is installed
Updated Git hooks.
Git LFS initialized.
17:38:07 | DEBUG    | Git LFS has been activated
17:38:08 | DEBUG    | Conda is installed...

To test the model I selected the smiles for `abacavir` and run predictions

The model successfully run with probability of 0.049 for abacavir

(ersilia) robel@DESKTOP-92JJ0KD:~/ersilia$ ersilia -v api predict -i "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
11:26:10 | DEBUG    | Getting session from /home/robel/eos/session.json...
-----------------------------------
{
"input": {
"key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
"input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
"text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
},
"output": {
"outcome": [
    0.049
]
}
}

I placed the Essential Medicines List inside assest folder and the prediction run successfully

(ersilia) robel@DESKTOP-92JJ0KD:~/ersilia$ ersilia -v api predict -i ./assets/eml_canonical.csv -o ersilia_prediction.csv
11:38:22 | DEBUG    | Getting session from /home/robel/eos/session.json...
-----------------
12:11:05 | DEBUG    | Status code: 200
12:11:05 | DEBUG    | Done with unique posting
12:11:17 | DEBUG    | Data: outcome
12:11:17 | DEBUG    | Values: [0.049]
12:11:17 | DEBUG    | Datatype: numeric_array
ersilia_prediction.csv
(ersilia) robel@DESKTOP-92JJ0KD:~/ersilia$

I have attached ersilia's model prediction ersilia_prediction.csv

TsionZerihun commented 8 months ago

Compare results with the Ersilia Model Hub implementation! (Cont'd) :heavy_check_mark:

I opened the two models in excel and filtered the ADME model unstable predictions which are 189 in total
Next I did the below 3 steps
- Bring only the prediction(probability) column from the two cells
- From the ADME predicted cell I remove the "parenthesis" and "1"(stability value ADME predicted) using excel =Remove() formula
- Check the diff between the the models `
- Filtered diff's by substracting ADME value with ersilia
- Checked using =COUNTIF() how many of the diffs are > 0.1, which turned out be be 8 out of 189
- Summary of comparison
- The above comparison shows that the two prediction have almost 98% similarly. (with 2% of the diffs being >=0.1 *similar in terms of prediction the drugs instability probability

I have attached comparison file below ADME_Ersilia_Comparsion.xlsx

TsionZerihun commented 8 months ago

Key takeaways from completing above tasks

- Lesson Learned

in vitro: "vitro is Latin for “in glass.” medical procedures/tests perform outside of a living organism. such as a test tube or petri dish"

in vivo: "research done on a living organism"

- Summary

The above task runs the model Rat Liver Microsomal Stability(RLM) and (eos5505)molecule and compares their output.

TsionZerihun commented 8 months ago

Install and run Docker! :heavy_check_mark:

I was able to install and run hello-world image from docker


@DESKTOP-92JJ0KD:/home/robel# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
719385e32844: Pull complete
Digest: sha256:88ec0acaa3ec199d3b7eaf73588f4518c25f9d34f58ce9a0df68429c5af48e8d
Status: Downloaded newer image for hello-world:latest

Hello from Docker! This message shows that your installation appears to be working correctly.

For more examples and ideas, visit: https://docs.docker.com/get-started/

@DESKTOP-92JJ0KD:/home/robel#

TsionZerihun commented 8 months ago

Optional Task: (WIP)

I noticed that there was an optional task given to replace the entire NCATS-ADME server with a Python script. And tried to follow this steps to achieve it
Clean code
- I Clean app.py file by removing codes used for other models
- I replaces the new(cleaned) app.py file and run python app.py to check if it still works which it does. Here is a preview of the file
from predictors.hlm.hlm_predictor import HLMPredictior
from predictors.pampa.pampa_predictor import PAMPAPredictior
from predictors.pampa50.pampa_predictor import PAMPA50Predictior
from predictors.pampabbb.pampa_predictor import PAMPABBBPredictior
from predictors.solubility.solubility_predictor import SolubilityPredictior
from predictors.liver_cytosol.lc_predictor import LCPredictor
from predictors.cyp450.cyp450_predictor import CYP450Predictor
```
```diff
```
if model.lower() == 'hlm':

predictor = HLMPredictior(kekule_smiles = working_df['kekule_smiles'].values, --smiles=working_df[smi_column_name].values)

if model.lower() == 'rlm':
    predictor = RLMPredictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)

elif model.lower() == 'pampa':
predictor = PAMPAPredictior(kekule_smiles = working_df['kekule_smiles'].values, -smiles=working_df[smi_column_name].values)
elif model.lower() == 'pampa50':
predictor = PAMPA50Predictior(kekule_smiles = working_df['kekule_smiles'].values, smiles=working_df[smi_column_name].values)
elif model.lower() == 'pampabbb':
predictor = PAMPABBBPredictior(kekule_smiles = working_df['kekule_smiles'].values, -smiles=working_df[smi_column_name].values)
elif model.lower() == 'solubility':
predictor = SolubilityPredictior(kekule_smiles = working_df['kekule_smiles'].values, -smiles=working_df[smi_column_name].values)
elif model.lower() == 'hlc':
predictor = LCPredictor(kekule_smiles = working_df['kekule_smiles'].values, -smiles=working_df[smi_column_name].values)
elif model.lower() == 'cyp450':

predictor = CYP450Predictor(kekule_mols = working_df['mols'].values, smiles=working_df[smi_column_name].values


```diff
if model.lower() != 'cyp450':
            # for all models except cyp450, calculate the nearest neigbors and add additional column to response_df
            try:
                sim_vals = get_similar_mols(response_df[smi_column_name].values, model.lower())
                sim_series = pd.Series(sim_vals).round(2).astype(str)
                response_df['Tanimoto Similarity'] = sim_series.values
                columns_dict['Tanimoto Similarity'] = { 'order': 3, 'description': 'similarity towards nearest neighbor in training data', 'isSmilesColumn': False }
            except Exception as e:
                app.logger.error('Error calculating similarity')
                app.logger.error(f'error type: {type(e)}')
                app.logger.error(e)

else:
for cyp450 models, a similarity value is calculated using a global dataset that is representative of all six cyp450 endpoints
try:
sim_vals = get_similar_mols(response_df[smi_column_name].values, model.lower())
sim_series = pd.Series(sim_vals).round(2).astype(str)
response_df['Tanimoto Similarity'] = sim_series.values
columns_dict['Tanimoto Similarity'] = { 'order': 7, 'description': 'similarity towards nearest neighbor in training data that was obtained by combining the compounds from all six individual datasets', 'isSmilesColumn': False }
except Exception as e:
app.logger.error('Error calculating similarity')
app.logger.error(f'error type: {type(e)}')
app.logger.error(e)
I went to the model folder and removed all other models except for RLM
```
@DESKTOP-92JJ0KD:~/ncats-adme/server/models$ ls
rlm
(base) robel@DESKTOP-92JJ0KD:~/ncats-adme/server/models$
```
I decide to hold the task because
1. I figured I would prioritize the things that are required of me now that I have limited time and come back to this later.
2. I wanted to check with my mentors regarding my approach and understanding of the question before moving forward

Planned to do

- Update the code to take arguments and run, instead of serve using flask

- Update the code to take arguments for the output's location.

TsionZerihun commented 8 months ago

Step I followed to select Model suggestion

APPROACH-1: Find models related to a particular pathogen.
1. I looked for highly infectious disease which are neglected or are emerging but not given much attention
2. I selected this 5 (dengue, Trachoma, tuberculosis, mycetoma, Schistosomiasis) disease.
3. I started searching ML models associated with them, but after obtaining some models for mycetoma and schistosomiasis, I noticed that they had already been added to Ersilia's model hub.
  - Mycetoma
  - Schistosomiasis
  - Finally, I narrowed down my search to denga and Trachoma
APPROACH-2: Look for drug-related models
- It was interesting to see the different ways a model can assist/speedup in drug discovery. Which includes predicting:
  - The drugs activity against the organism that caused the disease(Pathogens).
  - The drugs toxicity to the cells(Cytotoxicity)
  - The pharmacological properties of the lead molecules(drug)
  - The binding affinity of potential molecules to the target protein
  - Predict Genetic changes in pathogenic microorganisms and many more
Filtering searches: I selected models using the filtering strategy listed below.
- Have free to use open source code
- Work with smiles notation when accepting molecules or drugs (inputs are small molecules compounds,)
- Ensure that the documentation is clear and preferably dated 2020 or later if it was cited.
- Model is not already included in the Erailia Model hub, and backlog
- Can be installed and used smoothly
Finally I started looking for model leveraging the resources recommendation (i.e paper with code and others) here

TsionZerihun commented 8 months ago

Suggest a new model and document it: First Model :heavy_check_mark:

About the model
Name: Drug Combination (Graph Set) Generation
- Deep Generative Models for given Hierarchical Disease Network Embedding Source code: https://github.com/Shen-Lab/Drug-Combo-Generator/tree/master Publication: https://pubmed.ncbi.nlm.nih.gov/32657357/ Description: Drug-combo-generator is a deep generative model for drug combination design, by jointly embedding graph-structured domain knowledge and iteratively training a reinforcement learning-based chemical graph-set designer.
Why it would be relevant to Ersilia:
- Combination drug therapy involves combining drugs to enhance efficacy, treating various diseases like tuberculosis, HIV, and various cancers. Since infections are heterogeneous, the bacteria can be in different stages of development, you may require more than one dug.
- This could be important for Ersilia, as it focuses on infectious and neglected diseases. Targeting infectious diseases at multiple states is made easy with this model. Using the medication combination generator has the following advantages:
  - One drug may target one pathway, another drug another pathway.
  - Sometimes drugs enhance the efficacy of each other – this is called synergy. Synergistic combinations of drugs can help in terms of: -
  - Allowing to use lower drug doses
  - Minimizing side effects
  - Clearing the infection more efficiently.
  - Overcome resistance in antibiotics, anti-microbials, and anti-cancer drugs.

Model implementation :

Installation guide here

Install rdkit, mpi4py, networkx, OpenAI baseline dependencies, customized molecule gym environment

  ~$ conda create -c rdkit -n my-rdkit-env rdkit
  ~$ conda install mpi4py
  ~$ pip install networkx=1.11
   #OpenAI baseline dependencies
  ~$ cd rl-baselines
  ~$ pip install -e .
  #customized molecule gym environment
  ~$ cd gym-molecule
  ~$ pip install -e.

Running model

  ~$ python run_drug_comb_generator.py --disease_id=42

The model completed every item on the checklist below and I would suggested it.
- Have free to use open source code:heavy_check_mark:
- Work with smiles notation when accepting molecules or drugs (inputs are small molecules compounds,):heavy_check_mark:
- Ensure that the documentation is clear and preferably dated 2020 or later if it was cited.:heavy_check_mark:
- Model is not already included in the Erailia Model hub, and backlog:heavy_check_mark:

TsionZerihun commented 8 months ago

Suggest a new model and document it: Second Model :heavy_check_mark:

About the model

Name: DeepDTA: deep drug-target binding affinity Source code: https://github.com/XuanLin1991/DeepGS Publication: https://arxiv.org/abs/2003.13902 Description: Modeling of protein sequences and compound 1D representations (SMILES) with convolutional neural networks (CNNs) to predict the binding affinity value of drug-target pairs.
Why it would be relevant to Ersilia:
- The strength of the interaction between a drug and its target is known as binding affinity. Drug-target affinity prediction, a crucial step in virtual screening, directly impacts drug development progress.
- This could be important for Ersilia, to identifying or filter out convenient target drugs by providing information on the strength of the interaction between a drug–target (DT) pair

Model implementation :

Create a new environment.

  ~$ conda create -n deepgs python=3.7.6
  ~$ source activate deepgs
- Clone repository and install requirements.
 ```console
  ~$ git clone https://github.com/jacklin18/DeepGS.git 
  ~$ cd DeepGS 
  ~$ pip install -r requirements.txt

The model completed every item on the checklist below and I would suggested it.
- Have free to use open source code:heavy_check_mark:
- Work with smiles notation when accepting molecules or drugs (inputs are small molecules compounds,):heavy_check_mark:
- Ensure that the documentation is clear and preferably dated 2020 or later if it was cited.:heavy_check_mark:
- Model is not already included in the Erailia Model hub, and backlog:heavy_check_mark:

TsionZerihun commented 8 months ago

Suggest a new model and document it: Third Model :heavy_check_mark:

About the model

Name: DrugChat: deep drug-target binding affinity Source code: https://github.com/ucsd-ai4h/drugchat Publication: https://www.techrxiv.org/articles/preprint/DrugChat_Towards_Enabling_ChatGPT-Like_Capabilities_on_Drug_Molecule_Graphs/22945922 Description: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
Why it would be relevant to Ersilia:
- We have seen since the release of Chat GPT, accessibility to most resources has been tremendously improved
  - This model attempt towards enabling ChatGPT-like capabilities on drug molecule graphs, by developing a prototype system DrugChat. This could be important for Ersilia in pharmaceutical research by:
  - Accelerating drug discovery
  - Enhancing our understanding of structure-activity relationships
  - Guiding lead optimization
  - Aiding drug repurposing
  - Reducing the failure rate, and streamlining clinical trials.

Model implementation :

Clone repository and install requirements.

  ~$ git clone https://github.com/UCSD-AI4H/drugchat
  ~$ cd drugchat
  ~$ conda env create -f environment.yml
  ~$ conda activate drugchat

The model completed every item on the checklist below and I would suggested it.
- Have free to use open source code:heavy_check_mark:
- Work with smiles notation when accepting molecules or drugs (inputs are small molecules compounds,):heavy_check_mark:
- Ensure that the documentation is clear and preferably dated 2020 or later if it was cited.:heavy_check_mark:
- Model is not already included in the Erailia Model hub, and backlog:heavy_check_mark:

TsionZerihun commented 8 months ago

Suggest a new model and document it: Additional Models :heavy_check_mark:

Additional Models
- I also found some interesting additional models.
  
  Name: toxAIcity Source code: https://github.com/subhasishgoswami/toxAIcity Description: A Long short-term memory based classifier to classify new drug candidates if toxic using Simplified molecular-input line-entry system notation.
  
  Name: ChemCPA Source code: https://github.com/theislab/chemCPA Description: Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution
Out of scope models related to neglected and infectious models.
- While I filtered neglected diseases and looked for models related to to them, I found this interesting models but decided not to include them in my suggestion because they did not accept smiles as input
  - Changas
    
    Name: Chagas detection Source code: https://github.com/csbl-br/chagas_detection Description: Chagas detection is a tool for the detection of T. cruzi trypomastigote forms in blood smear images. In particular, it provides a machine learning based method for the detection of parasites in images acquired using a mobile phone camera
  - leptospirosis
    
    Name: lepto-classifier Source code: https://github.com/sf-deng/lepto-classifier Description: A SVM-based binary classifier to detect leptospirosis diseases in dogs

GemmaTuron commented 7 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia

✍️ Contribution period: Tsion Zeleke #880

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Task 1-3

I joined the slack channel and introduced myself :heavy_check_mark:

I opened this issues on git :heavy_check_mark:

Install pre-requisite & ersilia :heavy_check_mark:

pre-requisite

Name Version Build Channel

Test the simplest model eos3b5e :heavy_check_mark:

I successfully fetched and run the model

(ersilia) @DESKTOP-92JJ0KD:~$ ersilia serve eos3b5e 🚀 Serving model eos3b5e: molecular-weight URL: http://127.0.0.1:42699...

Key takeaways from completing above tasks

- Lesson Learned

I will make sure to reinstall isaura if caching is necessary when running future models.*

- Summary

Motivation Letter

Self Introduction

How I was Introduced to Machine Learning

Ersilia

Conclusion

Week 2

Select a model from the suggested list(Task 1) :heavy_check_mark:

I selected Rat Liver Microsomal Stability(RLM) from NCATS

So it uses rats microsomes to tests its microsomal stability which helps in screening of drug candidates in the early stage of drug development*

Install the model in your system (Task 2):heavy_check_mark:

I cloned this Repo

I commented out other models since this slack instruction mentioned that the server will download every model at once, which would take a lot of time

I was getting collecting package metadata (repodata.json): \ Killed error

Next step was running app.py

Run predictions for the EML (Task 3) :heavy_check_mark:

I Opened localhost:5000 → Predict → uncheck all except RLM → Text File → Browse file → Process File

I downloaded the csv file with the list of drugs to be tested from here

It was mind blowing to see 442 drugs being tested for stability in Rats Liver(Microsome) within minutes 💥

What this outcome indicates

I was curios to see how many percentage of the drugs were stable and saw that 256 from total 442 were stable (around 58.5%). I filtered and put the stable and unstable drugs in two tabs and attached it here.

Compare results with the Ersilia Model Hub implementation! (Tak 4) :heavy_check_mark:

I searched for the RLM model in ersilia repo (eos5505)

Understanding Ersilia's backend.

To test the model I selected the smiles for abacavir and run predictions

I placed the Essential Medicines List inside assest folder and the prediction run successfully

Compare results with the Ersilia Model Hub implementation! (Cont'd) :heavy_check_mark:

I opened the two models in excel and filtered the ADME model unstable predictions which are 189 in total

Next I did the below 3 steps

Summary of comparison

Key takeaways from completing above tasks

- Lesson Learned

- Summary

Install and run Docker! :heavy_check_mark:

I was able to install and run hello-world image from docker

Hello from Docker! This message shows that your installation appears to be working correctly.

Optional Task: (WIP)

I noticed that there was an optional task given to replace the entire NCATS-ADME server with a Python script. And tried to follow this steps to achieve it

Clean code

if model.lower() == 'hlm':

predictor = HLMPredictior(kekule_smiles = working_df['kekule_smiles'].values, --smiles=working_df[smi_column_name].values)

for cyp450 models, a similarity value is calculated using a global dataset that is representative of all six cyp450 endpoints

I went to the model folder and removed all other models except for RLM

Planned to do

- Update the code to take arguments and run, instead of serve using flask

- Update the code to take arguments for the output's location.

Step I followed to select Model suggestion

APPROACH-1: Find models related to a particular pathogen.

APPROACH-2: Look for drug-related models

It was interesting to see the different ways a model can assist/speedup in drug discovery. Which includes predicting:

Filtering searches: I selected models using the filtering strategy listed below.

Finally I started looking for model leveraging the resources recommendation (i.e paper with code and others) here

Suggest a new model and document it: First Model :heavy_check_mark:

About the model

Why it would be relevant to Ersilia:

Model implementation :

The model completed every item on the checklist below and I would suggested it.

Suggest a new model and document it: Second Model :heavy_check_mark:

About the model

Why it would be relevant to Ersilia:

Model implementation :

The model completed every item on the checklist below and I would suggested it.

To test the model I selected the smiles for `abacavir` and run predictions