ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
198 stars 128 forks source link

Contribution Period: Kabirat Adeniyi #823

Closed Kadeniyi23 closed 10 months ago

Kadeniyi23 commented 11 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

carcablop commented 11 months ago

Hello @Kadeniyi23 Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.

Kadeniyi23 commented 11 months ago

Week 1

Whew! Just figured out we're meant to comment the milestone.

1st Task: Joining the communication channels

On the 3rd of October i Introduced myself to the #general channel on slack. image

Kadeniyi23 commented 11 months ago

2nd Task:Creating an issue

I Successfully created an issue on the 3rd of October 💪

Kadeniyi23 commented 11 months ago

3rd Task: Install the Ersilia Model Hub and test the simplest model

The instructions to the the third task are listed here

  1. For the first stage of installation I installed WSL as I am using a Windows Operating system and not Linux. This ran smoothly

  2. Next I installed the GCC compiler using the below code: sudo apt install build-essential I used the Windows subsystem for Linux instead of the Ubuntu terminal as my Ubuntu terminal was not working.

image

After multiple tries i used the WSL Terminal instead.

  1. I installed Miniconda on the WSL Terminal and it was successful. 4.I successfully installed the Github CLI image

  2. I successfully installed the Git LFS from Conda image

  3. The Git LFS was installed and initialized image

Installing Ersilia

To test a model with Ersilia

The output: running_molecular_weight.log

It yielded a type error TypeError: object of type 'NoneType' has no len()

This also aligns with similar problems being faced here

Also using the code ersilia -v api calculate -i "CCCC" yielded a Key Error as shown below

image

Following this suggestion here, i changed the base code in the file file.py from if len(h) == 1: to if h is not None and len(h) == 1:.

After the following the suggestion and running the code

ersilia -v run -i "CCCC" > running_molecular_weight_2.log 2>&1

I got the expected output logged

running_molecular_weight_2 (1).log

image

Following the suggestion of @DhanshreeA here I set up the Conda environment again using python 3.7 and reinstalled Ersilia.

I was able to successfully get the expected output after reinstalling Ersilia and running the model eos3b5e image

DhanshreeA commented 11 months ago

Thank you for the updates @Kadeniyi23. A quick feedback: You do not need to modify the code within ersilia repository if you run into this error. The correct command to run an api is as you have mentioned above ersilia -v run -i <input>. It is because of this command that you got the correct output (and not because of updating the ersilia code).

Kadeniyi23 commented 11 months ago

Yes I believe so too. I tried it after modifying the code but it came out with an error, but after I reinstalled Ersilia with python 3.7 I was able to get the expected output. Thanks for your feedback

Kadeniyi23 commented 11 months ago

Fourth task:

Motivation Statement: Why I Joined Outreachy and wish to work at Ersilia 🚀

My name is Adeniyi Kabirat. As a data scientist and an aspiring AI/ML engineer, I worked with a few data models in the past, ranging from building a highly intricate recommendation system to building machine learning models in hackathons, but I joined Outreachy to be able to contribute to open source. Working with open source has not particularly been a dream of mine since I started my journey in data science in 2020, but along the way, I came to learn that a lot of open source programs and companies truly help change the world. And I really wanted to be a part of that. Hence, I applied for the Outreachy program. When picking the programs to contribute to after being picked as an applicant, Ersilia was the one program that stood out to me. A company that creates AI/ML models for biomedical research. Sign me up! Given my background—a bachelor's degree in microbiology—and my history of data science, I believe I would be a true asset to the Ersilia team. My current skills include proficiency in Python, R, and Conda. While I haven't had much experience with Docker, I have been involved with a few side projects that have utilized the platform.

Joining the Ersilia community provides me with an avenue to join a meaningful program that aims to bridge gaps that should not exist. A particular goal that aligns with mine is Ersilia, supporting research on infectious and neglected diseases in low-income countries. Being from a low-income country myself, I have seen the effects of infectious disease in a community, and a company that makes that a goal is one I will be delighted to work with.

Being picked as an applicant and eventually as an intern provides me with an opportunity to contribute to a community and workspace that prioritizes growth and provides easy and open access to AI/ML models and research. My time spent as an intern would be one spent growing and learning, building and budding an experience with Python, Docker, and Conda, and contributing to a team that seeks to provide medical solutions worldwide.  I would be fully immersed in an AI/ML project while collaborating with minds worldwide to seek a solution to a problem. Post-internship, I hope to be able to come out the other side with more well-rounded knowledge in AI and ML, adding to the Ersilia team as a whole and contributing more to open-source programs.  Furthermore, I am eager to gain hands-on experience in implementing AI and ML algorithms and techniques and understand how they can be applied in the healthcare industry. This internship would also provide me with the opportunity to enhance my problem-solving skills and learn from experienced professionals in the field, ultimately preparing me for a successful career in AI/ML research and development. 

Kadeniyi23 commented 11 months ago

Fifth Task: Submit your first contribution to the Outreachy site

I have submitted my initial contribution to the Outreachy website

Kadeniyi23 commented 10 months ago

Week 2

First Task: Selecting a model from the suggested list.

After going through the suggested models, I selected the STOUT (SMILES to IUPAC). I selected the model after reading the publication here

I selected the model for two major reasons:

Kadeniyi23 commented 10 months ago

Second Task: Installing the model in system.

Step 1: Following the instructions on the github page, I downloaded MIniconda3 on my Linux with the code wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Step 2: To install Miniconda, I ran the following code : bash Miniconda3-latest-Linux-x86_64.sh

Step 3: To activate Miniconda and test for the version on Miniconda source ~/.bashrc conda --version

Step 4: Installing STOUT

Installation Error ❌

When I ran the conda install -c decimer stout-pypi code it presented the error. The error log is installation_error.log


Output in format: Requested package -> Available versions The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.36=0
  - feature:|@/linux-64::__glibc==2.36=0

Your installed version is: 2.3'

Trying Another Method

Using the github repository directly, I attempted to download the package. with the following code

pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

STOUT-pip was successfully installed

Step 5 : Simple Usage. Saving the example to a python file and running it on the WSL command line, I encountered an error. OSError: [Errno 0] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly image

Following this error, I installed a default version of Java using the code

sudo apt update
sudo apt install openjdk-11-jre

image

After that, I set the JAVA_HOME variable to ensure it is set properly using the below code export JAVA_HOME=/usr/bin

Then I ran the python file to get the desired output. image

DhanshreeA commented 10 months ago

Good job! @Kadeniyi23 I have a small question, why did you need to install conda again? Did you not have conda on your system from having installed ersilia before?

Kadeniyi23 commented 10 months ago

Thank you fo the feedback, @DhanshreeA . I transitioned my approach , shifting from using the Windows Subsystem for Linux (WSL) command-line interface (CLI) to Visual Studio Code. After the integration of the Visual Studio Code interface with WSL, I went ahead to reinstall Miniconda to ensure its it worked properly

Kadeniyi23 commented 10 months ago

Third Task: Run predictions for the EML

To run predictions for the EML, I first attempted to run the following code through the STOUT model, it then proceeded to yield a JAVA IMPLEMENTATION ERROR

import csv
from STOUT import translate_forward

# Define a function to translate SMILES to IUPAC name
def smiles_to_iupac(smiles):
    try:
        iupac_name = translate_forward(smiles)
        return iupac_name
    except Exception as e:
        return str(e)  # Return an error message if translation fails

# Path to the input CSV file
input_csv_path = '/root/miniconda3/envs/eos4f95/bin/eml_canonical.csv'

# Path to the output CSV file
output_csv_path = '/root/miniconda3/envs/eos4f95/bin/translated_results.csv'

# Open the input CSV file for reading and the output CSV file for writing
with open(input_csv_path, 'r') as input_csvfile, open(output_csv_path, 'w', newline='') as output_csvfile:
    csvreader = csv.reader(input_csvfile)

    # Skip the header row if it exists
    header = next(csvreader, None)

    # Create a CSV writer for the output file
    csvwriter = csv.writer(output_csvfile)

    # Write the header to the output CSV file
    if header:
        csvwriter.writerow(header + ["IUPAC Name"])  # Add a new column header

    # Iterate through each row of the input CSV file
    for row in csvreader:
        # Assuming the SMILES strings are in the second column (index 1)
        smiles = row[1]

        # Translate the SMILES to IUPAC name
        iupac_name = smiles_to_iupac(smiles)

        # Write the row to the output CSV file, including the new IUPAC name
        csvwriter.writerow(row + [iupac_name])

print(f"Results have been written to {output_csv_path}.")

The error is detailed here hs_err_pid16667.log

Various attempts to debug were made, including searching for the error on Stack Overflow and soliciting help from the slack group page

leilayesufu commented 10 months ago

https://github.com/ersilia-os/ersilia/issues/823#issuecomment-1751671814

Try importing translate_reverse too

Kadeniyi23 commented 10 months ago

Thank you. The python file you shared also gave the same error. I did it in a couple of ways,

All shared the same error 😞

leilayesufu commented 10 months ago

https://github.com/ersilia-os/ersilia/issues/823#issuecomment-1751694319

But you could run predictions earlier with it, when testing? I think it might have been an issue with the jdk you installed

Kadeniyi23 commented 10 months ago

When I looked at it, I saw that it involved me downloading an earlier version of Java (13.0) as JPype was only tested with versions 1-13.0. Installing an earlier version of Java in which the version I used was 17.1 was not recommended for production on the JAVA website.I figured this was because with every update comes a lot of bug -fixing.

Kadeniyi23 commented 10 months ago

Week 2

First Task: Selecting a model from the suggested list.

After many tries to debug the STOUT (SMILES to IUPAC) I picked , I have made the decision to switch to the NCATS Rat Liver Microsomal Stability. Reading the documentation the NCATS- ADME contains several model that would be industrious to pharmacy and pharmocology as a whole. The different models created have different capabilities , an example is the the RLM Stability model, which helps in predicting the stability of compound. This would researchers to be able to the potential stability and lifespan of a compund in the body. Another example is the PAMPA ph 7.4 model which gauges the permeability of drugs across cellular membranes. With this, researchers are able to predict the likelihood of a drug being easily absorbed in the body.

But the main reason I chose this model, is it encomprises more than one AI/ML model which enables to have a front seat look to different Machine learning models implemented. In the PAMPA ph 7.4 model, Chemprop a model built by MIT is used.

Kadeniyi23 commented 10 months ago

Second Task: Installing the model in system.

I followed this steps to install the NCATS Rat Liver Microsomal Stability model in my system.

Kadeniyi23 commented 10 months ago

Third Task : Run predictions for the EML

Using the Essential Medicines List gotten from here, I downloaded the file.

Input CSV file name

input_file = 'eml_canonical.csv'

Output CSV file name

output_file = 'SMILES.csv'

Function to extract the second column from the input CSV and save it to the output CSV

def extract_second_column(input_file, output_file): try: with open(input_file, 'r', newline='') as infile, open(output_file, 'w', newline='') as outfile:

Create CSV reader and writer objects

        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        for row in reader:
            if len(row) >= 2:  # Check if the row has at least two columns
                second_column = row[1]  # Index 1 is the second column (0-based index)
                writer.writerow([second_column])

    print(f"Second column extracted from '{input_file}' and saved to '{output_file}'.")

except FileNotFoundError:
    print(f"File '{input_file}' not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Call the function to extract the second column

extract_second_column(input_file, output_file)


-
Kadeniyi23 commented 10 months ago

Task 4 :Run predictions for the EML

On running the app on my system, I open the app on chrome hereand run the csv with the SMILES notation on the app. I got the following results: RLM(Rat Liver Microsomal Stability)-ADME_Predictions_2023-10-11-132525.csv Pion’s patented µSOL assay (Solubility)- ADME_Predictions_2023-10-11-132606.csv Parallel artificial membrane permeability assay (PAMPA)(Assay pH=7.4)- ADME_Predictions_2023-10-11-132710.csv

Parallel artificial membrane permeability assay (PAMPA)(Assay pH=5.0)ADME_Predictions_2023-10-11-132657.csv

Human Liver Cytosolic Stability- ADME_Predictions_2023-10-11-132911.csv

DhanshreeA commented 10 months ago

Hi @Kadeniyi23 It is unfortunate that JRE kept giving you issues while trying to run STOUT, and it is good that you could get NCATS to run on your system. As a bonus task, could you try and get the NCATS model to run not as a server but as a simple python script? Let me know if you need any clarifications.

Kadeniyi23 commented 10 months ago

Hi @DhanshreeA . Thank you for your feedback. Further clarification is needed. Do you mean running the app.py python script independently in another environment created in order to be able to run the model

Kadeniyi23 commented 10 months ago

Task 4: Compare results with the Ersilia Model Hub implementation!

To compare the results gotten from Parallel artificial membrane permeability assay (PAMPA)(Assay pH=7.4) in the csv file here to the model implemented in the Ersilia model Hub:

Comparing the two models.

Parallel Artificial Membrane Permeability is an in vitro surrogate to determine the permeability of drugs across cellular membranes. In an attempt to understand the model used, the Parallel artificial membrane permeability assay is used to measure how easily substances that pass through synthetic substances that mimic the lining of the human gastro-intestinal tract. In the original model provided by NCATS-ADME, it seeks to predict if a compound has very low or high permeability.If the predicted class is '1', it means the compound is predicted to have 'low or moderate permeability' (i.e., log Peff < 2.0) and if the predicted class is '0', the compound is predicted to have 'high permeability' (i.e., log Peff > 2.5). In the intepretation of the eos9tyg model given here, the output type is given in float denoting the probability of the compound being poorly permeable. The higher the number, the more likely it is poorly permeable

Taking the first ten values and seeking to compare the two predictions Compound Ersilia Model Eos9tyg prediction Permeability NCATS model Permeability
Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 1 poor permeability 1 (0.9) low permeability
C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 0.156 medium to high permeability 0 (1.0) moderate or high permeability
CC(=O)Nc1sc(nn1)S(=O) 1 poor permeability 1 (0.97) low permeability 1 (1.0) low permeability
CC(O)=O 1 poor permeability 1 (0.99) low permeability
CC(=O)NC@@HC(O)=O 1 poor permeability 0 (0.96) moderate or high permeability
CC(=O)Oc1ccccc1C(O)=O 1 poor permeability 1 (0.99) low permeability
NC1=NC(=O)c2ncn(COCCO)c2N1 1 poor permeability 1 (0.99) low permeability
OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 0.034 medium to high permeability 0 (0.99) moderate or high permeability
CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 0.248 medium to high permeability 0 (0.99) moderate or high permeability
CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 0.268 medium to high permeability 0 (0.99) moderate or high permeability

A hundred percent accuracy was seen in the 10 compounds predicted. 💯

Kadeniyi23 commented 10 months ago

Task 5: Install and run Docker!

I was successfully able to install and run Docker Hub. I was also able to successfully run the model eos3b5e from Docker desktop 👍

Kadeniyi23 commented 10 months ago

Week 3: Suggest a new model and document it (1)

CalcAMP Model

Model Name

CalcAMP

Description

In this model, the authors seek to predict the activity of antimicrobial peptides. Antimicrobial peptides(AMPs) can be quite effective in fighting the multi-drug resistance pandemic worldwide. Finding effective and potent AMPs is an ardouos process and the development of a machine learning process that can accurately predict whether a peptide possesses these antimicrobial properties would be useful and is a time-saving process. The machine learning model predicts the antimicrobial activity of peptides by analyzing various features, including general physicochemical properties and sequence composition.

Publication details

Model Overview

The dataset of peptides was collated from a publicly available data from five different databases. The comparison of different ML algorithms to develop a classification model between AMP and non-AMP were made using the package PyCaret 2.3.6. Additionally, a Multi-layer Perceptron model created with Scikit-Learn 0.23.2 was used for the comparison. The final models that were ultimately created for the retained algorithms include LightGBM, XGBoost, CatBoost, Random Forest (RF), and Extra Trees (ET) classifiers.

Relevance to Ersilia

With Ersilia's goal of democratising access to AI/ML models relating to biomedical research, the CalcAMP model which predicts the antimicrobial activity of peptides is an added boon when it comes research of multi-drug resistance. It enables us to assess the different qualities of different AMPs, as well as detect which ones would active against a plethora of Gram positive and Gram Negative bacteria.

Model Implementation

The link to the model CalcAMP Although the model has not been published and released yet, an example is denoted in Simple prediction.ipynb where a sample prediction is shown. The different models are also saved in the models folder.The dataset used is linked here.

Kadeniyi23 commented 10 months ago

Week 3: Suggest a new model and document it (2)

AquaPred Model

Model Name

AquaPred

Description

This model seeks to accurately predict molecular solubility of compounds using Attention-Based Graph Neural Network. In drug discovery. This machine learning model plays a significant role in predicting aqueous solubility of compounds in drug discovery. During drug discovery, Active pharmaceutical ingredients are a key ingredient for high drug efficacy. The authors, with this model aim to predict the aqueous solubility of compounds which is a key physicochemical attribute required for API characterization.

Publication Details

Model Overview

The model uses the dataset contained here as compiled by an alternative research referenced here. The data was fitted to four different graph neural networks namely SGConv, GIN, GAT, and AttentiveFP to identify the most effective model for predicting solubility. The study shows that Attentive FP was the best model which uses SMILES as the input for molecular representation and and captures both intermolecular and intramolecular properties through information propagation and gated recurrent units (GRU).

Relevance to Ersilia

In-silico prediction of water solubility could alternatively lead to higher efficacy for drugs while speeding up drug development timeline. One of Ersilia's goals is to support research in many Low and Middle Income countries. With machine learning models like this, we get to bypass weeks or maybe months of research, tapping into the power of Artificial Intelligence to accelerate the drug discovery process.

Model Implementation

The code to the model can be found here. No recent releases have been published, but the code look ready to go. Assessing the AttentiveFP model here used in the models folder, further testing could be done to scan for bugs.

Kadeniyi23 commented 10 months ago

Week 3: Suggest a new model and document it (3)

PrankWeb 3

Model Name

P2Rank

Description

This model seeks to predict the Ligand binding sites(LBS) of proteins. Identification of theses sites and the interactions that ensues would be needed for elucidation of the molecular mechanisms of enzymes, regulation of protein oligomerization, or designing new drugs in cases where drug resistance has occurred which can be a time consuming process when performed experimentally. With this model, the protein's ligand binding site is predicted with the protein's 3-dimensional structure. The model not only comprises of the CL app(P2Rank), but also a webapp PrankWeb3. PrankWeb accepts a protein structure on its input, computes evolutionary conservation, and predicts binding sites which are then mapped onto the structure and can be viewed.

Publication Details

Model Overview

The model has two implementations: The CLI app- P2Rank and the web app -PrankWeb3. P2Rank not only used machine learning based knowledge but also a combination of geometric, energetic and evolution based knowledge which is a combination seen with the experimental method used for ligand-binding site prediction of proteins. It then applies different characteristics (the protein's structure, physico-chemical properties, and evolutionary information) to a mesh and then construct a machone ;earning model using this representation. The ML model is then used to identify points on the protein's surface that can potentially bind to ligands and proceed to group the identified points together list of surface patches that correspond to the predicted ​Ligand Binding Sites (LBSs).

Relevance to Ersilia

One of the core reasons of implementing this model is designing of new drugs in cases when there is a sudden case of drug resistance. In cases of Low and Middle Income countries, where drug-resistant strains may arise, the rapid implementation o drug designing and production may save millions of lives.

Model Implementation

Following the Installation steps, the requirements to install P2Rank is Java and PyMOl which is used to view visualization. It is recommended to view it bash as the model is a command-line program. The model looks implementable with the link to the code found here. No installation is required as the package is downloaded as github releases. The latest version (version 2.4.1) can be downloaded as a compressed file. With various commands, the input is entered as pdb file and predicted values will be generated as follows:

The web app is available here. This system can be implemented in three modes

Kadeniyi23 commented 10 months ago

Side Task: Running the NCATS model as a single python script

Hi @Kadeniyi23 It is unfortunate that JRE kept giving you issues while trying to run STOUT, and it is good that you could get NCATS to run on your system. As a bonus task, could you try and get the NCATS model to run not as a server but as a simple python script? Let me know if you need any clarifications.

DhanshreeA commented 10 months ago

Hi @Kadeniyi23 many thanks for the updates and sincerest apologies for responding late. Please look at this comment for further clarification. https://github.com/ersilia-os/ersilia/issues/849#issuecomment-1768229150 Also it is a bonus task, please don't feel pressured.

GemmaTuron commented 10 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!