Closed leilayesufu closed 11 months ago
Today, 2nd of October 2023. I got accepted into the outreachy contribution stage and chose to contribute to Ersilia Open Source Initiative project. I'm very excited to learn and grow during this contribution stage.
Task 1: Join the communication channels I joined the slack workspace and introduced myself in the general Channel.
Task 2: Open a GitHub issue (this one!) Opened the issue. Done :)
Task 3: Install the Ersilia Model Hub and test the simplest model October 2nd 2023
Following the instructions here.
I installed the Ersilia Model hub from the Command line Interface. I'm using a windows device so i had to use the windows subsytem for Linux with an Ubuntu terminal.
There are prerequisites to be installed before downloading the Model hub. This can be seen as follows.
Install the gcc compiler on Ubuntu terminal
This can be down with by running the following
sudo apt-get update
sudo apt install build-essential
The gcc compiler has been successfully installed.
Install Python and Conda
I installed the Miniconda using the following command.
mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh
Conda has been successfully installed. This can be tested using conda --version
Install Git and GitHub CLI
The next step is to configure Git and GitHub CLI on Conda. This can be done with the following command.
conda install gh -c conda-forge
This is to install the github cli
gh auth login
This is to configure your Cli with your credentials.
Follow the instructions given and you should be logged in to git.
Install and Activate Git LFS
This can be done with the following command.
conda install git-lfs -c conda-forge
git-lfs install
Install The Isaura data lake.
#activate ersilia's conda environment (see instructions below to create it) conda activate ersilia python -m pip install isaura==0.1
Docker I already had docker installed and running on my local machine.
Main Installation.
Install Ersilia
After all the pre-requisites are met, we are ready to install the Ersilia tool.
In your terminal, run the following commands to create and activate the environment
conda create -n ersilia python=3.7
conda activate ersilia
Then, simply install the Ersilia Python package.
git clone https://github.com/ersilia-os/ersilia.git cd ersilia pip install -e .
Erisilia environment has been successfully installed.
Check that the CLI works on your terminal, and explore the available commands
ersilia --help
ersilia catalog
We have successfully installed the Ersilia model hub.
Task 3: Install the Ersilia Model Hub and test the simplest model This is a continuation, in this part we're going to be testing a very simple model with Ersilia . Run the following commands.
ersilia -v fetch eos3b5e
ersilia serve eos3b5e
The first two ran without errors
The calculate however gave me some errors
ersilia -v api calculate -i "CCCC"
I tried it at first and got an error such as this Upon further inspection of the logs, i noticed that the key error at the end of the traceback was due to the fact that the code was trying to access a dictionary key 'calculate' in the schema, but was not finding it.
Upon double checking the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json i confirmed that there was no calculate key only a run key. I then updated the schema as thus using values from the logs.
I then ran it again and got another error but this wasn't due to key error, it was due to an issue with reading input columns in the code cause the object had no len
Opening the code at /home/leila/ersilia/ersilia/io/readers/file.py, In the read_input_columns function at line 321.
I changed the code from
if len(h) == 1:
to
if h is not None and len(h) == 1:
This change ensures that the length check is only performed when h is not None. If h is None,the code proceeds accordingly. After making these changes, i ran the command
ersilia -v api calculate -i "CCCC"
and i got this
I have gotten the input section but not the output. upon further troubleshooting i found out my mistake in the schema file at /home/leila/eos/dest/eos3b5e/api_schema.json when creating the calculate i used "result" instead of "mw" After changing that i get
I'm still finding it difficult to get the output value
Hello @leilayesufu,
You don't have to change the schema as it is generated automatically when you fetch and serve the model.
Instead of using calculate as the api, use run.
Forexample, instead of ersilia -v api calculate -i "CCCC"
, use ersilia -v run -i "CCCC"
Seen, Thank you very much
After a correction by @HellenNamulinda
She informed me that i could have used
ersilia -v api run -i "CCCC"
instead of adding my own key to the schema and trying ersilia -v api calculate -i "CCCC"
Following this, I removed the environment and created it again.
After running ersilia -v run -i "CCCC"
I got the output.
NOTE: I didn't alter the codebase after installing it back.
Task 3 has been completed successfully. Thank you @HellenNamulinda
**Task 4: Write a motivation statement to work at Ersilia October 3rd 2023**
Hi, my name is Leila Yesufu, and I completed an Electrical Engineering degree in 2022.
I am a Devops Engineer with experience in Data science and ML/AI.
I applied for Outreachy to learn more about open source.
Upon receiving the initial congratulatory email from Outreachy into the contribution stage, I was excited to begin my contribution and learning. I browsed through the projects available, and the Ersilia project caught my attention, I decided there and then that it was the only project I was going to contribute to. I read through the website and I was further intrigued by Ersilia's Open Source Initiative and its commitment to building positive change in healthcare accessibility using data and AI/ML.
Ersilia's Open Source Initiative also resonates with me on a personal level, as it reflects my passion in both Healthcare and Technology. I am from Nigeria, a middle-income country, I've experienced firsthand challenges that communities face in accessing tools for disease research such as the Ebola crisis of 2014.
I really appreciate the value that this project can bring to the world and i would love to work towards bringing that value.
The project's roadmap, focuses on access to models and building capacity in data science, I also read that Ersilia is currently setting up a sustainable cloud infrastructure (AWS) to enable online ML model inference, which I am experienced in. This presents a chance for me to showcase and enhance my skills in Python, JavaScript, Git, Machine learning, Deep learning, conda, Google colab, Django, Docker, Kubernetes AWS ML such as AWS sagemaker. I am also open and eager to learn more technologies during the internship thus expanding my knowledge beyond what I currently know.
During the internship if granted , I will be committed to fully immersing myself in projects, collaborating with the Ersilia team, and contributing meaningfully. I hope to actively engage to maximize my learning and impact.I hope to learn more about the Artificial Intelligence, Machine learning field and hopefully build a career in it.
Post-internship, I hope to not only have made valuable contributions to Ersilia but also to continue being part of Ersilia’s team and other projects and communities that uses technology for social good. I see this internship as a stepping stone towards a life where I can actively contribute to projects that address real-world challenges in both healthcare and technology.
Task 5: Submit your first contribution to the Outreachy site
I submitted an application to Ersilia through the Outreachy website and linked this issue as part of my contribution.
Hi @leilayesufu, great work! Thanks for the updates. :)
Thank youuu! @DhanshreeA
@DhanshreeA I've successfully completed Week 1 - Get to know the community, Do i go ahead to Week 2 - Install and run an ML model?
Hi @leilayesufu yes absolutely. Thanks for the updates. You can go ahead and get started with week 2 tasks.
Task 6: Select a model from the suggested list
After going through the proposed models, i selected Plasma Protein Binding (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt
After reading this publication for the model https://pubs.acs.org/doi/10.1021/acs.jcim.2c00297 I learnt that the objective is to predict the human plasma protein binding (PPB *) property of compounds with the use of an interpretable deep learning method.
This sparked my interest because the use of interpretable deep learning method the model will not only provide predictions but it will also provide understanding into factors that influence PPB. I am fascinated at the potential impact on drug development, and the scientific insights that can be achieved from these predictions, I am also curious to know to if these drugs once they get inside our bodies, will stick to plasma proteins or move freely.
To install this model
I created a conda environmnt named 'IDL-PPBopt' and activated the environment using the following commands:
conda create -n IDL-PPBopt python=3.7
conda activate IDL-PPBopt
Then I installed the following packages which the model requires:
conda install pytorch==1.5.0 torchvision cpuonly -c pytorch
Other required packages
conda install -c conda-forge openbabel==2.4.1
pip install rdkit==2022.9.4
pip install scikit-learn==1.0.2
pip install scipy==1.7.3
conda install -c conda-forge cairosvg==2.3.0
Although, the repository authory didn't add panda and matplotlib, i also installed them; using the following commands
pip install pandas==1.3.5
pip install matplotlib==3.5.3
I viewed all my installed packages with the command
conda list
the output here conda_list.txt
Week 2 - Install and run an ML model
Task 6: Select a model from the suggested list
After going through the proposed models, i selected NCATS Rat Liver Microsomal Stability: https://github.com/ncats/ncats-adme
After reading this publication for the model https://pubs.acs.org/doi/10.1021/acs.jcim.2c00297 I learnt that the objective is to predict the human plasma protein binding (PPB *) property of compounds with the use of an interpretable deep learning method.
This sparked my interest because the use of interpretable deep learning method the model will not only provide predictions but it will also provide understanding into factors that influence PPB. I am fascinated at the potential impact on drug development, and the scientific insights that can be achieved from these predictions, I am also curious to know to if these drugs once they get inside our bodies, will stick to plasma proteins or move freely.
Hi @leilayesufu thanks for the updates. However there's a slight confusion in your comment. You mention selecting "NCATS Rat Liver Microsomal Stability", however it appears you have worked with "Plasma Protein Binding (IDL-PPBopt)" instead. These are two different models.
A side note, pandas
, and matplotlib
were not mentioned as explicit dependencies within the original repo because they are requirements for scikit-learn, so anytime you install scikit-learn, those two dependencies get installed as well.
Ah, Thank you so much for the correction!!! I'm working on Plasma Protein Binding (IDL-PPBopt): https://github.com/Louchaofeng/IDL-PPBopt
I'll update it now. Thank you for the information on scikit-learn as well
To test the IDL-PPBopt model, I located ipynb jupyter notebook that is used run the code seen here , i then extracted all the steps needed to predict values and saved it in a python file i created and named model.py, I did this because i want to use python for the predictions.
The code was still dependent on cuda and gpu, so i had to edit the model.py file i created.
I first defined the device to CPU using the following command
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
I then edited the all the files dependent on cuda
to load the file, i changed the loading command from model.cuda()
to model.to(device)
Then i tried running the file, i got the error shown here, This was because the files made use of imports that is dependent on IPython from jupyter. So i had to comment out all the lines dependent on IPython.
After this was done, I tried running it again and i got the error shown in the log here
To fix this, i changed the line of code
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt')
to
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt',map_location=torch.device('cpu'))
Then i ran it again and it finally worked.
view log file here.
View the CSV output of the default input file in the Plasma Protein Binding (IDL-PPBopt) model here LeilaYesufuPredictions.csv
The IDL-PPBopt model has been installed and run on my system. Task 7: Install the model in your system has been completed successfully
To Run predictions for the Essential Medicines List with the IDL-PPBopt model.
I inputted the given EML list into my model.py i editted the heading from smiles to cano_smiles then i ran the command python model.py
This is the successful log, and this is the predictions output log for IDL-PPBopt.txt
IDL-PPBopt model_eml_canonical_output.csv
Explanation of the result
The model uses an Interpretable Deep Learning Method to help us understand how likely a compound is to stick to plasma proteins, The model takes CANO_SMILES as input and returns PPB fractions as output. PPB fraction above 80% represents a good affinity or connection with plasma proteins While PPB fraction less than 40% represents a low level of affinity.
The model should be a regression model because it outputs continuous values.
Ersilia can run by downloading models from GitHub (using Git-LFS), from S3 buckets (our AWS backend) and by downloading models as Docker containers.
I decided to challenge myself and run the model with Docker containers
I located the model in the in the Ersilia Model Hub. Then i went to docker hub and got the command for pulling the image.
"docker pull ersiliaos/eos22io" view the log here
I ran this in a detached mode as seen .
After running the container, i inspected it here and found that the container has a custom entry point specified, and it's using the shell sh
so i accessed the shell terminal with the following command docker exec -it containerid sh
and i ran the predictions of the EML file given on it
here are the logs
explanation of logs
i ran this two commands as seen in the Dockerfile
cd /root/
sh docker-entrypoint.sh
The model has already been served then i used this command to download the EML list into the Docker shell
wget https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv
then i ran this command to run the predictions on the downloaded ersilia -v api run -i eml_canonical.csv -o ersilia_output.csv
to run the file and save the result in a new file called ersilia_output.csv
The file can be seen here.
I exited the docker interactive shell and used the below command to copy the output generated from inside the docker container to my home directory
docker cp dbff41b896d1:/root/ersilia_output.csv /home/leila/ersilia/log_files
The IDL-PPBopt model with EML: IDL-PPBopt.model_eml_canonical_output.csv Running EML with Ersilia/ersiliaos/eos22io: ersilia_output.csv
I checked the output of every single result and they were the same apart from number 97, 98 in the IDL-PPBopt model which had an output of 0.93916506, 0.9521944 while in the ersilia one had an output of 0.9391651,0.95219433 so a slight rounding up difference.
It should also be noted that the running with the ersilia model produced 9 nan results making it 442 in number while the IDL-PPBopt model result left the nan results out making it 433 in number.
This is because as seen in the log of the IDL-PPBopt model here, 9 compounds could not be featured.
From my understanding, the ersiliaos/eos22io featured them and gave a nan result while IDL-PPBopt model left them out completely
I have successfully completed week 2 tasks.
Hi @leilayesufu thank you for the detailed updates. Proteins are organic compounds (containing C-H or C-C bonds), whereas the nine compounds that are not featurized within the original code are inorganic compounds or salts. Since the input and output file for a model within ersilia needs to be of the same length, ersilia simply outputs null corresponding to the molecules that don't produce an output (because they don't get featurized)
Thank you very much!
Slug: CLAMP
Model title: Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language
Publication: https://arxiv.org/abs/2303.03363
Github repo: https://github.com/ml-jku/clamp#clamp-clamp
license: GPLv3
Code: Python
checkpoints provided: yes
Year: 2023
Description:
CLAMP (Contrastive Language-Assay Molecule Pre-Training) is a ML model trained on pairs molecule-bioassay pairs. It can be communicated to in natural language. CLAMP can be used to make predictions about which molecules are most relevant to a particular bioassay when given a textual description of the bioassay.
From the publication, my understanding is that CLAMP can be used to predict how well a particular molecule might work for a given task, like being a drug to treat a disease or having specific properties for a material. It does this by understanding textual descriptions of tasks. you can give CLAMP a list of molecules (SMILES) and a text or description and it predicts which is the best molecule according to the description. The higher the probability means that the model believes that the molecule is suitable for the written bioassay.
Datasets used:
Relevance to Ersilia: I believe it'll be relevant to Ersilia because it is an AI/ML tool that can be used in drug discovery. CLAMP's also designed for zero shot transfer learning in drug discovery which means that the model can make predictions for molecules on bioassays it wasn't pretrained with. This can be used in drug discovery to explore new molecules for drugs.
Slug: DrugApp
Title: Machine learning-based prediction of drug approvals using molecular, physicochemical, clinical trial, and patent-related features
Publication: https://www.tandfonline.com/doi/abs/10.1080/17460441.2023.2153830
Source Code: https://github.com/fulyaciray/DrugApp
Description: DrugApp is a tool that uses machine learning and different data sources to predict the likelihood/potential of regulatory approval for drugs. The tool uses data from clinical trials and patent-related information. It also considers the physical and chemical properties of the drug candidate molecules. The prediction method used is random forest classifier to build disease group-specific approval prediction models. The disease group the model predicts the regulatory approval for includes "Alimentary", "Infective", "Blood", "Dermatological", "Heart", "Hormonal", "Immunological", "Musculoskeletal", "Neoplasms", "Nervous", "Rare", "Respiratory", "Sensory", and "Urinary" The project contributes to drug discovery as it helps the scientists/ drug makers predict if their drug will be approved by regulatory bodies, It also saves time and resources in the drug discovery process.
License: GPL-3.0 license
Code: Python
Year: 2022
Slug: RLBind
Title: RLBind: a deep learning method to predict RNA–ligand binding sites
Publication: https://www.researchgate.net/publication/365587894_RLBind_a_deep_learning_method_to_predict_RNA-ligand_binding_sites
Source Code: https://github.com/KailiWang1/RLBind/tree/main
Description: This model uses a deep convolutional neural network (CNN) to predict locations on RNA molecules where ligands are most likely to bind. This can be useful when designing drugs and in drug discovery and also in using RNAs as targets for treating diseases. RNA according to google means ribonucleic acid, a nucleic acid present in all living cells.
License: Apache-2.0 license
Code: Python
Model 1: CLAMP
I set up the environment using the commands on the GitHub repository, a yaml file was given with all the dependencies so running the conda env create -f env.yml
created the environment and installed all the python and pip dependencies.
I activated the environment .
I then added the pretrained model given in a main.py file and ran python main.py
I encountered the following error cliperror.txt, This is due to clip being a dependency not mentioned in the environment file, this was fixed by installing (Contrastive Language-Image Pretraining) model via
pip install git+https://github.com/openai/CLIP.git
then i ran the file again and i got the following required output here output.txt.
I then changed the content of the pretrained model to have the first 4 smiles of the EML file with the same bioassay and got this result here. smiles.txt
To test this model, i followed the instructions to set up in the github repository and downloaded the required dependencies.
Then i navigated to the scripts directory and i ran the following command python rf_model_for_predicting_drug_approval.py Rare
to test for drug approval of a rare disease group. I encountered the following error error.txt. This was due to the file path being wrong. The code made use of a Windows-style path separator () while i was running the code in Linux Ubuntu. so i changed to the appropriate Seperator / in all the neccessary file path.
I ran the command python rf_model_for_predicting_drug_approval.py Rare
again and got the following log. rare_output.txt
That just shows the date and time but it saved a results csv file as seen here.
results_prospective_analysis_Rare.csv
I also ran the command python rf_model_for_predicting_drug_approval.py Blood
here is the output and the log.
log: blood.txt
output: results_prospective_analysis_Blood.csv
This can be done with python evaluation_metrics_and_feature_importance.py Rare
. note i changed the file path seperator in this file too.
The output will be the performance scores and a plot for permutation feature importance analysis, which are printed on the screen. Additionally, a csv file will be created (in the "results_feature_importances" folder) that contain the MDI based feature importance analysis results, where the file name will start with "results_feature_importances_MDI" (e.g. results_feature_importances_MDI_Rare).
CSV FILE CREATED:
results_feature_importances_MDI_Rare.csv
For this model, i cloned the git repository and i installed the listed requirements one after the other. I didn't use the environment yaml file given because it is for a gpu enabled version.
conda create -n RLBind python==3.7
to create a new environment
I installed
conda install pytorch==1.5.0 torchvision cpuonly -c pytorch
pytorch cpu only
pip install --upgrade torch
pip install numpy==1.16.4
numpy
pip install scikit-learn
scikit-learn
pip install pandas
To test the model i ran
cd ./src/
python predict.py
I got the following error cuda.txt due to the use of cuda in the file, so i edited the predict.py file so it wouldn't depend on cuda.
I removed the line model = model.cuda()
and changed the line from
model.load_state_dict(torch.load(model_file))
to model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))
then i ran python predict.py
again and i got the predictions as seen below.
@DhanshreeA HI, please i'd love some feedback on the models i provided
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application