Closed Kadeniyi23 closed 10 months ago
Hello @Kadeniyi23 Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.
Whew! Just figured out we're meant to comment the milestone.
On the 3rd of October i Introduced myself to the #general channel on slack.
I Successfully created an issue on the 3rd of October 💪
The instructions to the the third task are listed here
For the first stage of installation I installed WSL as I am using a Windows Operating system and not Linux. This ran smoothly
Next I installed the GCC compiler using the below code:
sudo apt install build-essential
I used the Windows subsystem for Linux instead of the Ubuntu terminal as my Ubuntu terminal was not working.
After multiple tries i used the WSL Terminal instead.
I installed Miniconda on the WSL Terminal and it was successful. 4.I successfully installed the Github CLI
I successfully installed the Git LFS from Conda
The Git LFS was installed and initialized
To activate
conda activate ersilia
to install the Ersilia package from github
git clone https://github.com/ersilia-os/ersilia.git cd ersilia pip install -e .
Next I installed lsaura data lake -version 1 using the following code
python -m pip install isaura==0.1
conda create -n ersilia
ersilia --help
Fetching the Model: Eos3b5e Molecular weight.
The output for the following code ersilia -v fetch eos3b5e > princed.log 2>&1
is
princed.log
Serving The model
To serve the model eos3b5e, I used the following code ersilia serve eos3b5e > serving_molecular_weight.log 2>&1
, to reach an output of serving_molecular_weight.log
Running the Model
To run the model, I used the below code
ersilia -v run -i "CCCC" > running_molecular_weight.log 2>&1
.
The output: running_molecular_weight.log
It yielded a type error TypeError: object of type 'NoneType' has no len()
This also aligns with similar problems being faced here
Also using the code ersilia -v api calculate -i "CCCC"
yielded a Key Error as shown below
Following this suggestion here, i changed the base code in the file file.py from if len(h) == 1:
to if h is not None and len(h) == 1:
.
After the following the suggestion and running the code
ersilia -v run -i "CCCC" > running_molecular_weight_2.log 2>&1
I got the expected output logged
running_molecular_weight_2 (1).log
Following the suggestion of @DhanshreeA here I set up the Conda environment again using python 3.7 and reinstalled Ersilia.
Fetching the model eos3b5e with the following code ersilia -v fetch eos3b5e > fetching_molecular_weight.log 2>&1
yielded the following output
fetching_molecular_weight.log
Serving the model with the following code ersilia serve eos3b5e > serving_molecular_weight2.log 2>&1
yielded the following output serving_molecular_weight2.log
Running the model with the following code ersilia -v run -i "CCCC" > running_molecular_weight3.log 2>&1
yielded the following output running_molecular_weight3.log
I was able to successfully get the expected output after reinstalling Ersilia and running the model eos3b5e
Thank you for the updates @Kadeniyi23. A quick feedback: You do not need to modify the code within ersilia repository if you run into this error. The correct command to run an api is as you have mentioned above ersilia -v run -i <input>
. It is because of this command that you got the correct output (and not because of updating the ersilia code).
Yes I believe so too. I tried it after modifying the code but it came out with an error, but after I reinstalled Ersilia with python 3.7 I was able to get the expected output. Thanks for your feedback
Fourth task:
My name is Adeniyi Kabirat. As a data scientist and an aspiring AI/ML engineer, I worked with a few data models in the past, ranging from building a highly intricate recommendation system to building machine learning models in hackathons, but I joined Outreachy to be able to contribute to open source. Working with open source has not particularly been a dream of mine since I started my journey in data science in 2020, but along the way, I came to learn that a lot of open source programs and companies truly help change the world. And I really wanted to be a part of that. Hence, I applied for the Outreachy program. When picking the programs to contribute to after being picked as an applicant, Ersilia was the one program that stood out to me. A company that creates AI/ML models for biomedical research. Sign me up! Given my background—a bachelor's degree in microbiology—and my history of data science, I believe I would be a true asset to the Ersilia team. My current skills include proficiency in Python, R, and Conda. While I haven't had much experience with Docker, I have been involved with a few side projects that have utilized the platform.
Joining the Ersilia community provides me with an avenue to join a meaningful program that aims to bridge gaps that should not exist. A particular goal that aligns with mine is Ersilia, supporting research on infectious and neglected diseases in low-income countries. Being from a low-income country myself, I have seen the effects of infectious disease in a community, and a company that makes that a goal is one I will be delighted to work with.
Being picked as an applicant and eventually as an intern provides me with an opportunity to contribute to a community and workspace that prioritizes growth and provides easy and open access to AI/ML models and research. My time spent as an intern would be one spent growing and learning, building and budding an experience with Python, Docker, and Conda, and contributing to a team that seeks to provide medical solutions worldwide. I would be fully immersed in an AI/ML project while collaborating with minds worldwide to seek a solution to a problem. Post-internship, I hope to be able to come out the other side with more well-rounded knowledge in AI and ML, adding to the Ersilia team as a whole and contributing more to open-source programs. Furthermore, I am eager to gain hands-on experience in implementing AI and ML algorithms and techniques and understand how they can be applied in the healthcare industry. This internship would also provide me with the opportunity to enhance my problem-solving skills and learn from experienced professionals in the field, ultimately preparing me for a successful career in AI/ML research and development.
I have submitted my initial contribution to the Outreachy website
After going through the suggested models, I selected the STOUT (SMILES to IUPAC). I selected the model after reading the publication here
I selected the model for two major reasons:
Step 1: Following the instructions on the github page, I downloaded MIniconda3 on my Linux with the code
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Step 2: To install Miniconda, I ran the following code :
bash Miniconda3-latest-Linux-x86_64.sh
Step 3: To activate Miniconda and test for the version on Miniconda
source ~/.bashrc
conda --version
Step 4: Installing STOUT
conda create --name STOUT python=3.8
conda activate STOUT
conda install -c decimer stout-pypi
When I ran the conda install -c decimer stout-pypi
code it presented the error. The error log is installation_error.log
Output in format: Requested package -> Available versions The following specifications were found to be incompatible with your system:
- feature:/linux-64::__glibc==2.36=0
- feature:|@/linux-64::__glibc==2.36=0
Your installed version is: 2.3'
Using the github repository directly, I attempted to download the package. with the following code
pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
STOUT-pip was successfully installed
Step 5 : Simple Usage.
Saving the example to a python file and running it on the WSL command line, I encountered an error.
OSError: [Errno 0] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly
Following this error, I installed a default version of Java using the code
sudo apt update
sudo apt install openjdk-11-jre
After that, I set the JAVA_HOME variable to ensure it is set properly using the below code
export JAVA_HOME=/usr/bin
Then I ran the python file to get the desired output.
Good job! @Kadeniyi23 I have a small question, why did you need to install conda again? Did you not have conda on your system from having installed ersilia before?
Thank you fo the feedback, @DhanshreeA . I transitioned my approach , shifting from using the Windows Subsystem for Linux (WSL) command-line interface (CLI) to Visual Studio Code. After the integration of the Visual Studio Code interface with WSL, I went ahead to reinstall Miniconda to ensure its it worked properly
To run predictions for the EML, I first attempted to run the following code through the STOUT model, it then proceeded to yield a JAVA IMPLEMENTATION ERROR
import csv
from STOUT import translate_forward
# Define a function to translate SMILES to IUPAC name
def smiles_to_iupac(smiles):
try:
iupac_name = translate_forward(smiles)
return iupac_name
except Exception as e:
return str(e) # Return an error message if translation fails
# Path to the input CSV file
input_csv_path = '/root/miniconda3/envs/eos4f95/bin/eml_canonical.csv'
# Path to the output CSV file
output_csv_path = '/root/miniconda3/envs/eos4f95/bin/translated_results.csv'
# Open the input CSV file for reading and the output CSV file for writing
with open(input_csv_path, 'r') as input_csvfile, open(output_csv_path, 'w', newline='') as output_csvfile:
csvreader = csv.reader(input_csvfile)
# Skip the header row if it exists
header = next(csvreader, None)
# Create a CSV writer for the output file
csvwriter = csv.writer(output_csvfile)
# Write the header to the output CSV file
if header:
csvwriter.writerow(header + ["IUPAC Name"]) # Add a new column header
# Iterate through each row of the input CSV file
for row in csvreader:
# Assuming the SMILES strings are in the second column (index 1)
smiles = row[1]
# Translate the SMILES to IUPAC name
iupac_name = smiles_to_iupac(smiles)
# Write the row to the output CSV file, including the new IUPAC name
csvwriter.writerow(row + [iupac_name])
print(f"Results have been written to {output_csv_path}.")
The error is detailed here hs_err_pid16667.log
Various attempts to debug were made, including searching for the error on Stack Overflow and soliciting help from the slack group page
https://github.com/ersilia-os/ersilia/issues/823#issuecomment-1751671814
Try importing translate_reverse too
Thank you. The python file you shared also gave the same error. I did it in a couple of ways,
All shared the same error 😞
https://github.com/ersilia-os/ersilia/issues/823#issuecomment-1751694319
But you could run predictions earlier with it, when testing? I think it might have been an issue with the jdk you installed
When I looked at it, I saw that it involved me downloading an earlier version of Java (13.0) as JPype was only tested with versions 1-13.0. Installing an earlier version of Java in which the version I used was 17.1 was not recommended for production on the JAVA website.I figured this was because with every update comes a lot of bug -fixing.
After many tries to debug the STOUT (SMILES to IUPAC) I picked , I have made the decision to switch to the NCATS Rat Liver Microsomal Stability. Reading the documentation the NCATS- ADME contains several model that would be industrious to pharmacy and pharmocology as a whole. The different models created have different capabilities , an example is the the RLM Stability model, which helps in predicting the stability of compound. This would researchers to be able to the potential stability and lifespan of a compund in the body. Another example is the PAMPA ph 7.4 model which gauges the permeability of drugs across cellular membranes. With this, researchers are able to predict the likelihood of a drug being easily absorbed in the body.
But the main reason I chose this model, is it encomprises more than one AI/ML model which enables to have a front seat look to different Machine learning models implemented. In the PAMPA ph 7.4 model, Chemprop a model built by MIT is used.
I followed this steps to install the NCATS Rat Liver Microsomal Stability model in my system.
git clone --recursive https://github.com/ncats/ncats-adme.git
cd /home/kabirat/ncats-adme
conda env create --prefix ./env -f environment.yml
python app.py
Using the Essential Medicines List gotten from here, I downloaded the file.
import csv
input_file = 'eml_canonical.csv'
output_file = 'SMILES.csv'
def extract_second_column(input_file, output_file): try: with open(input_file, 'r', newline='') as infile, open(output_file, 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if len(row) >= 2: # Check if the row has at least two columns
second_column = row[1] # Index 1 is the second column (0-based index)
writer.writerow([second_column])
print(f"Second column extracted from '{input_file}' and saved to '{output_file}'.")
except FileNotFoundError:
print(f"File '{input_file}' not found.")
except Exception as e:
print(f"An error occurred: {e}")
extract_second_column(input_file, output_file)
-
On running the app on my system, I open the app on chrome hereand run the csv with the SMILES notation on the app. I got the following results: RLM(Rat Liver Microsomal Stability)-ADME_Predictions_2023-10-11-132525.csv Pion’s patented µSOL assay (Solubility)- ADME_Predictions_2023-10-11-132606.csv Parallel artificial membrane permeability assay (PAMPA)(Assay pH=7.4)- ADME_Predictions_2023-10-11-132710.csv
Parallel artificial membrane permeability assay (PAMPA)(Assay pH=5.0)ADME_Predictions_2023-10-11-132657.csv
Human Liver Cytosolic Stability- ADME_Predictions_2023-10-11-132911.csv
Hi @Kadeniyi23 It is unfortunate that JRE kept giving you issues while trying to run STOUT, and it is good that you could get NCATS to run on your system. As a bonus task, could you try and get the NCATS model to run not as a server but as a simple python script? Let me know if you need any clarifications.
Hi @DhanshreeA . Thank you for your feedback. Further clarification is needed. Do you mean running the app.py
python script independently in another environment created in order to be able to run the model
To compare the results gotten from Parallel artificial membrane permeability assay (PAMPA)(Assay pH=7.4) in the csv file here to the model implemented in the Ersilia model Hub:
Parallel artificial membrane permeability assay
and pick the model
Parallel Artificial Membrane Permeability Assay (PAMPA) 7
.ersilia -v fetch eos9tyg
successfully 😄 Parallel Artificial Membrane Permeability is an in vitro surrogate to determine the permeability of drugs across cellular membranes. In an attempt to understand the model used, the Parallel artificial membrane permeability assay is used to measure how easily substances that pass through synthetic substances that mimic the lining of the human gastro-intestinal tract. In the original model provided by NCATS-ADME, it seeks to predict if a compound has very low or high permeability.If the predicted class is '1', it means the compound is predicted to have 'low or moderate permeability' (i.e., log Peff < 2.0) and if the predicted class is '0', the compound is predicted to have 'high permeability' (i.e., log Peff > 2.5). In the intepretation of the eos9tyg model given here, the output type is given in float denoting the probability of the compound being poorly permeable. The higher the number, the more likely it is poorly permeable
Taking the first ten values and seeking to compare the two predictions | Compound | Ersilia Model Eos9tyg prediction | Permeability | NCATS model | Permeability | |
---|---|---|---|---|---|---|
Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 | 1 | poor permeability | 1 (0.9) | low permeability | ||
C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | 0.156 | medium to high permeability | 0 (1.0) | moderate or high permeability | ||
CC(=O)Nc1sc(nn1)S(=O) | 1 | poor permeability | 1 (0.97) | low permeability | 1 (1.0) | low permeability |
CC(O)=O | 1 | poor permeability | 1 (0.99) | low permeability | ||
CC(=O)NC@@HC(O)=O | 1 | poor permeability | 0 (0.96) | moderate or high permeability | ||
CC(=O)Oc1ccccc1C(O)=O | 1 | poor permeability | 1 (0.99) | low permeability | ||
NC1=NC(=O)c2ncn(COCCO)c2N1 | 1 | poor permeability | 1 (0.99) | low permeability | ||
OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 | 0.034 | medium to high permeability | 0 (0.99) | moderate or high permeability | ||
CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 | 0.248 | medium to high permeability | 0 (0.99) | moderate or high permeability | ||
CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 | 0.268 | medium to high permeability | 0 (0.99) | moderate or high permeability |
A hundred percent accuracy was seen in the 10 compounds predicted. 💯
I was successfully able to install and run Docker Hub. I was also able to successfully run the model eos3b5e from Docker desktop 👍
CalcAMP
In this model, the authors seek to predict the activity of antimicrobial peptides. Antimicrobial peptides(AMPs) can be quite effective in fighting the multi-drug resistance pandemic worldwide. Finding effective and potent AMPs is an ardouos process and the development of a machine learning process that can accurately predict whether a peptide possesses these antimicrobial properties would be useful and is a time-saving process. The machine learning model predicts the antimicrobial activity of peptides by analyzing various features, including general physicochemical properties and sequence composition.
The dataset of peptides was collated from a publicly available data from five different databases. The comparison of different ML algorithms to develop a classification model between AMP and non-AMP were made using the package PyCaret 2.3.6. Additionally, a Multi-layer Perceptron model created with Scikit-Learn 0.23.2 was used for the comparison. The final models that were ultimately created for the retained algorithms include LightGBM, XGBoost, CatBoost, Random Forest (RF), and Extra Trees (ET) classifiers.
With Ersilia's goal of democratising access to AI/ML models relating to biomedical research, the CalcAMP model which predicts the antimicrobial activity of peptides is an added boon when it comes research of multi-drug resistance. It enables us to assess the different qualities of different AMPs, as well as detect which ones would active against a plethora of Gram positive and Gram Negative bacteria.
The link to the model CalcAMP Although the model has not been published and released yet, an example is denoted in Simple prediction.ipynb where a sample prediction is shown. The different models are also saved in the models folder.The dataset used is linked here.
AquaPred
This model seeks to accurately predict molecular solubility of compounds using Attention-Based Graph Neural Network. In drug discovery. This machine learning model plays a significant role in predicting aqueous solubility of compounds in drug discovery. During drug discovery, Active pharmaceutical ingredients are a key ingredient for high drug efficacy. The authors, with this model aim to predict the aqueous solubility of compounds which is a key physicochemical attribute required for API characterization.
The model uses the dataset contained here as compiled by an alternative research referenced here. The data was fitted to four different graph neural networks namely SGConv, GIN, GAT, and AttentiveFP to identify the most effective model for predicting solubility. The study shows that Attentive FP was the best model which uses SMILES as the input for molecular representation and and captures both intermolecular and intramolecular properties through information propagation and gated recurrent units (GRU).
In-silico prediction of water solubility could alternatively lead to higher efficacy for drugs while speeding up drug development timeline. One of Ersilia's goals is to support research in many Low and Middle Income countries. With machine learning models like this, we get to bypass weeks or maybe months of research, tapping into the power of Artificial Intelligence to accelerate the drug discovery process.
The code to the model can be found here. No recent releases have been published, but the code look ready to go. Assessing the AttentiveFP model here used in the models folder, further testing could be done to scan for bugs.
P2Rank
This model seeks to predict the Ligand binding sites(LBS) of proteins. Identification of theses sites and the interactions that ensues would be needed for elucidation of the molecular mechanisms of enzymes, regulation of protein oligomerization, or designing new drugs in cases where drug resistance has occurred which can be a time consuming process when performed experimentally. With this model, the protein's ligand binding site is predicted with the protein's 3-dimensional structure. The model not only comprises of the CL app(P2Rank), but also a webapp PrankWeb3. PrankWeb accepts a protein structure on its input, computes evolutionary conservation, and predicts binding sites which are then mapped onto the structure and can be viewed.
The model has two implementations: The CLI app- P2Rank and the web app -PrankWeb3. P2Rank not only used machine learning based knowledge but also a combination of geometric, energetic and evolution based knowledge which is a combination seen with the experimental method used for ligand-binding site prediction of proteins. It then applies different characteristics (the protein's structure, physico-chemical properties, and evolutionary information) to a mesh and then construct a machone ;earning model using this representation. The ML model is then used to identify points on the protein's surface that can potentially bind to ligands and proceed to group the identified points together list of surface patches that correspond to the predicted Ligand Binding Sites (LBSs).
One of the core reasons of implementing this model is designing of new drugs in cases when there is a sudden case of drug resistance. In cases of Low and Middle Income countries, where drug-resistant strains may arise, the rapid implementation o drug designing and production may save millions of lives.
Following the Installation steps, the requirements to install P2Rank is Java and PyMOl which is used to view visualization. It is recommended to view it bash as the model is a command-line program. The model looks implementable with the link to the code found here. No installation is required as the package is downloaded as github releases. The latest version (version 2.4.1) can be downloaded as a compressed file. With various commands, the input is entered as pdb file and predicted values will be generated as follows:
The web app is available here. This system can be implemented in three modes
Hi @Kadeniyi23 It is unfortunate that JRE kept giving you issues while trying to run STOUT, and it is good that you could get NCATS to run on your system. As a bonus task, could you try and get the NCATS model to run not as a server but as a simple python script? Let me know if you need any clarifications.
Hi @Kadeniyi23 many thanks for the updates and sincerest apologies for responding late. Please look at this comment for further clarification. https://github.com/ersilia-os/ersilia/issues/849#issuecomment-1768229150 Also it is a bonus task, please don't feel pressured.
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application