ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
210 stars 141 forks source link

✍️ Contribution period: Leila Yesufu #820

Closed leilayesufu closed 11 months ago

leilayesufu commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

leilayesufu commented 1 year ago
Screenshot 2023-10-02 203111

Today, 2nd of October 2023. I got accepted into the outreachy contribution stage and chose to contribute to Ersilia Open Source Initiative project. I'm very excited to learn and grow during this contribution stage.

Task 1: Join the communication channels I joined the slack workspace and introduced myself in the general Channel.

Task 2: Open a GitHub issue (this one!) Opened the issue. Done :)

leilayesufu commented 1 year ago

Task 3: Install the Ersilia Model Hub and test the simplest model October 2nd 2023

Following the instructions here.

I installed the Ersilia Model hub from the Command line Interface. I'm using a windows device so i had to use the windows subsytem for Linux with an Ubuntu terminal.

There are prerequisites to be installed before downloading the Model hub. This can be seen as follows.

  1. Install the gcc compiler on Ubuntu terminal This can be down with by running the following sudo apt-get update sudo apt install build-essential

    gcc

    The gcc compiler has been successfully installed.

  2. Install Python and Conda I installed the Miniconda using the following command. mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh

    Conda

    Conda has been successfully installed. This can be tested using conda --version

  3. Install Git and GitHub CLI

    The next step is to configure Git and GitHub CLI on Conda. This can be done with the following command. conda install gh -c conda-forge This is to install the github cli gh auth login This is to configure your Cli with your credentials. Follow the instructions given and you should be logged in to git.

    logged in
  4. Install and Activate Git LFS

    This can be done with the following command. conda install git-lfs -c conda-forge git-lfs install

git lfs
  1. Install The Isaura data lake. #activate ersilia's conda environment (see instructions below to create it) conda activate ersilia python -m pip install isaura==0.1

  2. Docker I already had docker installed and running on my local machine.

docker

Main Installation.

  1. Install Ersilia After all the pre-requisites are met, we are ready to install the Ersilia tool. In your terminal, run the following commands to create and activate the environment conda create -n ersilia python=3.7 conda activate ersilia

    Then, simply install the Ersilia Python package. git clone https://github.com/ersilia-os/ersilia.git cd ersilia pip install -e .

    image

    Erisilia environment has been successfully installed.

Check that the CLI works on your terminal, and explore the available commands ersilia --help

help

ersilia catalog

catalog

We have successfully installed the Ersilia model hub.

leilayesufu commented 1 year ago

Task 3: Install the Ersilia Model Hub and test the simplest model This is a continuation, in this part we're going to be testing a very simple model with Ersilia . Run the following commands.

ersilia -v fetch eos3b5e

ersilia serve eos3b5e

The first two ran without errors

serve

The calculate however gave me some errors

ersilia -v api calculate -i "CCCC"

I tried it at first and got an error such as this Upon further inspection of the logs, i noticed that the key error at the end of the traceback was due to the fact that the code was trying to access a dictionary key 'calculate' in the schema, but was not finding it.

Upon double checking the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json i confirmed that there was no calculate key only a run key. I then updated the schema as thus using values from the logs.

commands

I then ran it again and got another error but this wasn't due to key error, it was due to an issue with reading input columns in the code cause the object had no len

run error

Opening the code at /home/leila/ersilia/ersilia/io/readers/file.py, In the read_input_columns function at line 321. I changed the code from if len(h) == 1: to if h is not None and len(h) == 1:

This change ensures that the length check is only performed when h is not None. If h is None,the code proceeds accordingly. After making these changes, i ran the command

ersilia -v api calculate -i "CCCC"

and i got this

output null

I have gotten the input section but not the output. upon further troubleshooting i found out my mistake in the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json when creating the calculate i used "result" instead of "mw" After changing that i get

nulk

I'm still finding it difficult to get the output value

HellenNamulinda commented 1 year ago

Hello @leilayesufu, You don't have to change the schema as it is generated automatically when you fetch and serve the model. Instead of using calculate as the api, use run. Forexample, instead of ersilia -v api calculate -i "CCCC", use ersilia -v run -i "CCCC"

leilayesufu commented 1 year ago

Seen, Thank you very much

leilayesufu commented 1 year ago

After a correction by @HellenNamulinda She informed me that i could have used ersilia -v api run -i "CCCC" instead of adding my own key to the schema and trying ersilia -v api calculate -i "CCCC"

Following this, I removed the environment and created it again. After running ersilia -v run -i "CCCC" I got the output.

NOTE: I didn't alter the codebase after installing it back.

output

The log file can be seen here run.log

Task 3 has been completed successfully. Thank you @HellenNamulinda

leilayesufu commented 1 year ago

**Task 4: Write a motivation statement to work at Ersilia October 3rd 2023**

Motivation Statement to work at Ersilia

Hi, my name is Leila Yesufu, and I completed an Electrical Engineering degree in 2022.
I am a Devops Engineer with experience in Data science and ML/AI. I applied for Outreachy to learn more about open source. Upon receiving the initial congratulatory email from Outreachy into the contribution stage, I was excited to begin my contribution and learning. I browsed through the projects available, and the Ersilia project caught my attention, I decided there and then that it was the only project I was going to contribute to. I read through the website and I was further intrigued by Ersilia's Open Source Initiative and its commitment to building positive change in healthcare accessibility using data and AI/ML.

Ersilia's Open Source Initiative also resonates with me on a personal level, as it reflects my passion in both Healthcare and Technology. I am from Nigeria, a middle-income country, I've experienced firsthand challenges that communities face in accessing tools for disease research such as the Ebola crisis of 2014.

I really appreciate the value that this project can bring to the world and i would love to work towards bringing that value.

The project's roadmap, focuses on access to models and building capacity in data science, I also read that Ersilia is currently setting up a sustainable cloud infrastructure (AWS) to enable online ML model inference, which I am experienced in. This presents a chance for me to showcase and enhance my skills in Python, JavaScript, Git, Machine learning, Deep learning, conda, Google colab, Django, Docker, Kubernetes AWS ML such as AWS sagemaker. I am also open and eager to learn more technologies during the internship thus expanding my knowledge beyond what I currently know.

During the internship if granted , I will be committed to fully immersing myself in projects, collaborating with the Ersilia team, and contributing meaningfully. I hope to actively engage to maximize my learning and impact.I hope to learn more about the Artificial Intelligence, Machine learning field and hopefully build a career in it.

Post-internship, I hope to not only have made valuable contributions to Ersilia but also to continue being part of Ersilia’s team and other projects and communities that uses technology for social good. I see this internship as a stepping stone towards a life where I can actively contribute to projects that address real-world challenges in both healthcare and technology.

leilayesufu commented 1 year ago

Task 5: Submit your first contribution to the Outreachy site

I submitted an application to Ersilia through the Outreachy website and linked this issue as part of my contribution.

DhanshreeA commented 1 year ago

Hi @leilayesufu, great work! Thanks for the updates. :)

leilayesufu commented 1 year ago

Thank youuu! @DhanshreeA

leilayesufu commented 1 year ago

@DhanshreeA I've successfully completed Week 1 - Get to know the community, Do i go ahead to Week 2 - Install and run an ML model?

DhanshreeA commented 1 year ago

Hi @leilayesufu yes absolutely. Thanks for the updates. You can go ahead and get started with week 2 tasks.

leilayesufu commented 1 year ago

Week 2 - Install and run an ML model

Task 6: Select a model from the suggested list

After going through the proposed models, i selected Plasma Protein Binding (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt

After reading this publication for the model https://pubs.acs.org/doi/10.1021/acs.jcim.2c00297 I learnt that the objective is to predict the human plasma protein binding (PPB *) property of compounds with the use of an interpretable deep learning method.

This sparked my interest because the use of interpretable deep learning method the model will not only provide predictions but it will also provide understanding into factors that influence PPB. I am fascinated at the potential impact on drug development, and the scientific insights that can be achieved from these predictions, I am also curious to know to if these drugs once they get inside our bodies, will stick to plasma proteins or move freely.

leilayesufu commented 1 year ago

TASK 7: Install the model in your system

To install this model

I created a conda environmnt named 'IDL-PPBopt' and activated the environment using the following commands:

conda create -n IDL-PPBopt python=3.7 conda activate IDL-PPBopt

Then I installed the following packages which the model requires:

conda install pytorch==1.5.0 torchvision cpuonly -c pytorch

Other required packages

Although, the repository authory didn't add panda and matplotlib, i also installed them; using the following commands

I viewed all my installed packages with the command conda list

the output here conda_list.txt

DhanshreeA commented 1 year ago

Week 2 - Install and run an ML model

Task 6: Select a model from the suggested list

After going through the proposed models, i selected NCATS Rat Liver Microsomal Stability: https://github.com/ncats/ncats-adme

After reading this publication for the model https://pubs.acs.org/doi/10.1021/acs.jcim.2c00297 I learnt that the objective is to predict the human plasma protein binding (PPB *) property of compounds with the use of an interpretable deep learning method.

This sparked my interest because the use of interpretable deep learning method the model will not only provide predictions but it will also provide understanding into factors that influence PPB. I am fascinated at the potential impact on drug development, and the scientific insights that can be achieved from these predictions, I am also curious to know to if these drugs once they get inside our bodies, will stick to plasma proteins or move freely.

Hi @leilayesufu thanks for the updates. However there's a slight confusion in your comment. You mention selecting "NCATS Rat Liver Microsomal Stability", however it appears you have worked with "Plasma Protein Binding (IDL-PPBopt)" instead. These are two different models.

A side note, pandas, and matplotlib were not mentioned as explicit dependencies within the original repo because they are requirements for scikit-learn, so anytime you install scikit-learn, those two dependencies get installed as well.

leilayesufu commented 1 year ago

Ah, Thank you so much for the correction!!! I'm working on Plasma Protein Binding (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt

I'll update it now. Thank you for the information on scikit-learn as well

leilayesufu commented 1 year ago

To test the IDL-PPBopt model, I located ipynb jupyter notebook that is used run the code seen here , i then extracted all the steps needed to predict values and saved it in a python file i created and named model.py, I did this because i want to use python for the predictions.

The code was still dependent on cuda and gpu, so i had to edit the model.py file i created.

I first defined the device to CPU using the following command

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

I then edited the all the files dependent on cuda

to load the file, i changed the loading command from model.cuda() to model.to(device)

Then i tried running the file, i got the error shown here, This was because the files made use of imports that is dependent on IPython from jupyter. So i had to comment out all the lines dependent on IPython.

After this was done, I tried running it again and i got the error shown in the log here

To fix this, i changed the line of code

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt') to

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt',map_location=torch.device('cpu'))

Then i ran it again and it finally worked.

view log file here.

View the CSV output of the default input file in the Plasma Protein Binding (IDL-PPBopt) model here LeilaYesufuPredictions.csv

The IDL-PPBopt model has been installed and run on my system. Task 7: Install the model in your system has been completed successfully

leilayesufu commented 1 year ago

To Run predictions for the Essential Medicines List with the IDL-PPBopt model.

I inputted the given EML list into my model.py i editted the heading from smiles to cano_smiles then i ran the command python model.py

This is the successful log, and this is the predictions output log for IDL-PPBopt.txt

IDL-PPBopt model_eml_canonical_output.csv

Explanation of the result

The model uses an Interpretable Deep Learning Method to help us understand how likely a compound is to stick to plasma proteins, The model takes CANO_SMILES as input and returns PPB fractions as output. PPB fraction above 80% represents a good affinity or connection with plasma proteins While PPB fraction less than 40% represents a low level of affinity.

leilayesufu commented 1 year ago

The model should be a regression model because it outputs continuous values.

leilayesufu commented 12 months ago

Task 4: understand Ersilia's backend and running the EML list with Ersilia

Ersilia can run by downloading models from GitHub (using Git-LFS), from S3 buckets (our AWS backend) and by downloading models as Docker containers.

I decided to challenge myself and run the model with Docker containers

I located the model in the in the Ersilia Model Hub. Then i went to docker hub and got the command for pulling the image.

"docker pull ersiliaos/eos22io" view the log here

I ran this in a detached mode as seen .

container

After running the container, i inspected it here and found that the container has a custom entry point specified, and it's using the shell sh

so i accessed the shell terminal with the following command docker exec -it containerid shand i ran the predictions of the EML file given on it

here are the logs

explanation of logs

i ran this two commands as seen in the Dockerfile cd /root/ sh docker-entrypoint.sh

The model has already been served then i used this command to download the EML list into the Docker shell

wget https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv

then i ran this command to run the predictions on the downloaded ersilia -v api run -i eml_canonical.csv -o ersilia_output.csv

to run the file and save the result in a new file called ersilia_output.csv

The file can be seen here.

ersilia_output.csv

I exited the docker interactive shell and used the below command to copy the output generated from inside the docker container to my home directory

docker cp dbff41b896d1:/root/ersilia_output.csv /home/leila/ersilia/log_files

Comparison of the ouput from IDL-PPBopt and using ersilia to run it

The IDL-PPBopt model with EML: IDL-PPBopt.model_eml_canonical_output.csv Running EML with Ersilia/ersiliaos/eos22io: ersilia_output.csv

I checked the output of every single result and they were the same apart from number 97, 98 in the IDL-PPBopt model which had an output of 0.93916506, 0.9521944 while in the ersilia one had an output of 0.9391651,0.95219433 so a slight rounding up difference.

It should also be noted that the running with the ersilia model produced 9 nan results making it 442 in number while the IDL-PPBopt model result left the nan results out making it 433 in number.

This is because as seen in the log of the IDL-PPBopt model here, 9 compounds could not be featured.

From my understanding, the ersiliaos/eos22io featured them and gave a nan result while IDL-PPBopt model left them out completely

I have successfully completed week 2 tasks.

DhanshreeA commented 12 months ago

Hi @leilayesufu thank you for the detailed updates. Proteins are organic compounds (containing C-H or C-C bonds), whereas the nine compounds that are not featurized within the original code are inorganic compounds or salts. Since the input and output file for a model within ersilia needs to be of the same length, ersilia simply outputs null corresponding to the molecules that don't produce an output (because they don't get featurized)

leilayesufu commented 12 months ago

Thank you very much!

leilayesufu commented 11 months ago

WEEK 3

Suggest a new model and document it (1)

Slug: CLAMP

Model title: Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language

Publication: https://arxiv.org/abs/2303.03363

Github repo: https://github.com/ml-jku/clamp#clamp-clamp

license: GPLv3

Code: Python

checkpoints provided: yes

Year: 2023

Description:

CLAMP (Contrastive Language-Assay Molecule Pre-Training) is a ML model trained on pairs molecule-bioassay pairs. It can be communicated to in natural language. CLAMP can be used to make predictions about which molecules are most relevant to a particular bioassay when given a textual description of the bioassay.
From the publication, my understanding is that CLAMP can be used to predict how well a particular molecule might work for a given task, like being a drug to treat a disease or having specific properties for a material. It does this by understanding textual descriptions of tasks. you can give CLAMP a list of molecules (SMILES) and a text or description and it predicts which is the best molecule according to the description. The higher the probability means that the model believes that the molecule is suitable for the written bioassay.

Datasets used:

Relevance to Ersilia: I believe it'll be relevant to Ersilia because it is an AI/ML tool that can be used in drug discovery. CLAMP's also designed for zero shot transfer learning in drug discovery which means that the model can make predictions for molecules on bioassays it wasn't pretrained with. This can be used in drug discovery to explore new molecules for drugs.

leilayesufu commented 11 months ago

Suggest a new model and document it (2)

Slug: DrugApp

Title: Machine learning-based prediction of drug approvals using molecular, physicochemical, clinical trial, and patent-related features

Publication: https://www.tandfonline.com/doi/abs/10.1080/17460441.2023.2153830

Source Code: https://github.com/fulyaciray/DrugApp

Description: DrugApp is a tool that uses machine learning and different data sources to predict the likelihood/potential of regulatory approval for drugs. The tool uses data from clinical trials and patent-related information. It also considers the physical and chemical properties of the drug candidate molecules. The prediction method used is random forest classifier to build disease group-specific approval prediction models. The disease group the model predicts the regulatory approval for includes "Alimentary", "Infective", "Blood", "Dermatological", "Heart", "Hormonal", "Immunological", "Musculoskeletal", "Neoplasms", "Nervous", "Rare", "Respiratory", "Sensory", and "Urinary" The project contributes to drug discovery as it helps the scientists/ drug makers predict if their drug will be approved by regulatory bodies, It also saves time and resources in the drug discovery process.

License: GPL-3.0 license

Code: Python

Year: 2022

leilayesufu commented 11 months ago

Suggest a new model and document it (3)

Slug: RLBind

Title: RLBind: a deep learning method to predict RNA–ligand binding sites

Publication: https://www.researchgate.net/publication/365587894_RLBind_a_deep_learning_method_to_predict_RNA-ligand_binding_sites

Source Code: https://github.com/KailiWang1/RLBind/tree/main

Description: This model uses a deep convolutional neural network (CNN) to predict locations on RNA molecules where ligands are most likely to bind. This can be useful when designing drugs and in drug discovery and also in using RNAs as targets for treating diseases. RNA according to google means ribonucleic acid, a nucleic acid present in all living cells.

License: Apache-2.0 license

Code: Python

leilayesufu commented 11 months ago

Running the models.

Model 1: CLAMP

I set up the environment using the commands on the GitHub repository, a yaml file was given with all the dependencies so running the conda env create -f env.yml created the environment and installed all the python and pip dependencies. I activated the environment . I then added the pretrained model given in a main.py file and ran python main.py

I encountered the following error cliperror.txt, This is due to clip being a dependency not mentioned in the environment file, this was fixed by installing (Contrastive Language-Image Pretraining) model via pip install git+https://github.com/openai/CLIP.git

then i ran the file again and i got the following required output here output.txt.

I then changed the content of the pretrained model to have the first 4 smiles of the EML file with the same bioassay and got this result here. smiles.txt

leilayesufu commented 11 months ago

Model 2: DrugApp

To test this model, i followed the instructions to set up in the github repository and downloaded the required dependencies.

Then i navigated to the scripts directory and i ran the following command python rf_model_for_predicting_drug_approval.py Rare to test for drug approval of a rare disease group. I encountered the following error error.txt. This was due to the file path being wrong. The code made use of a Windows-style path separator () while i was running the code in Linux Ubuntu. so i changed to the appropriate Seperator / in all the neccessary file path.

I ran the command python rf_model_for_predicting_drug_approval.py Rare again and got the following log. rare_output.txt

That just shows the date and time but it saved a results csv file as seen here.

results_prospective_analysis_Rare.csv

I also ran the command python rf_model_for_predicting_drug_approval.py Blood here is the output and the log. log: blood.txt output: results_prospective_analysis_Blood.csv

The model also has commands to run the scripts for evaluation metrics and feature importances

This can be done with python evaluation_metrics_and_feature_importance.py Rare. note i changed the file path seperator in this file too.

The output will be the performance scores and a plot for permutation feature importance analysis, which are printed on the screen. Additionally, a csv file will be created (in the "results_feature_importances" folder) that contain the MDI based feature importance analysis results, where the file name will start with "results_feature_importances_MDI" (e.g. results_feature_importances_MDI_Rare).

RARE_OUTPUT

CSV FILE CREATED:
results_feature_importances_MDI_Rare.csv

leilayesufu commented 11 months ago

For this model, i cloned the git repository and i installed the listed requirements one after the other. I didn't use the environment yaml file given because it is for a gpu enabled version.

conda create -n RLBind python==3.7 to create a new environment I installed conda install pytorch==1.5.0 torchvision cpuonly -c pytorch pytorch cpu only pip install --upgrade torch pip install numpy==1.16.4 numpy pip install scikit-learn scikit-learn pip install pandas

To test the model i ran cd ./src/ python predict.py

I got the following error cuda.txt due to the use of cuda in the file, so i edited the predict.py file so it wouldn't depend on cuda. I removed the line model = model.cuda() and changed the line from model.load_state_dict(torch.load(model_file)) to model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))

then i ran python predict.py again and i got the predictions as seen below.

prediction.txt

leilayesufu commented 11 months ago

@DhanshreeA HI, please i'd love some feedback on the models i provided

GemmaTuron commented 11 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!