ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Nalita Hinds #831

Closed NaliH closed 8 months ago

NaliH commented 9 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

carcablop commented 9 months ago

Hello @NaliH Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.

NaliH commented 9 months ago

Hello @NaliH Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.

Hello, @carcablop! I have successfully installed Ersilia Model Hub and all prerequisites within WSL. The installation process went smoothly - I was able to successfully test the eos3b5e model.

NaliH commented 9 months ago

Motivation Statement

My name is Nalita Hinds, and I am a recent computer science graduate with a keen interest in AI/ML.

I decided to join Outreachy due to:

When I was first notified about being selected for the current contribution period, I saw that there was a long list of available projects – it was honestly a bit overwhelming to see them all! However, when I came across Ersilia, I immediately knew that I wanted to be a part of this project. This was solidified upon further reading of what Ersilia stands to achieve - supporting science, research, and medicine in low-income countries.

On top of that, I am highly interested in AI/ML and its implementations in healthcare and medicine. From surgery, to diagnostics, to drug discovery, to administrative workflows - there are so many applications for AI/ML! As multi-faceted as AI/ML is, I believe that there’s so much potential that still remains untapped in the field of medicine, and what current applications are there should be cultivated. I have aspirations to become a ML engineer with healthcare as a focus, and contributing to Ersilia is a huge step in that direction for me.

I am extremely excited to be able to learn more about Ersilia, AI/ML, and contribute to this project – to make a difference.

DhanshreeA commented 9 months ago

Thank you for the updates @NaliH, if you'd like you can get started with the tasks from week 2.

HellenNamulinda commented 8 months ago

Hello @NaliH, You are yet to start on week 2 tasks. Is there any way we can support you?

NaliH commented 8 months ago

Hello @NaliH, You are yet to start on week 2 tasks. Is there any way we can support you?

Hello, @HellenNamulinda I am currently in the process of completing the week 2 tasks. I will definitely reach out if I encounter any problems! I decided to work with the STOUT (SMILES to IUPAC) model and I'm setting up my environment now to be able to run predictions.

NaliH commented 8 months ago

Week 2

Task 1: Select a model

The model that I decided to work with is the STOUT (SMILES to IUPAC) model.


Why I chose this model

My reasons for choosing the STOUT model are:

NaliH commented 8 months ago

Week 2

Task 2: Installing the model

While the STOUT repo details how to install it, I still ran into some issues with WSL and installing STOUT.


Issues with WSL

I encountered an issue where WSL would not start. Whenever I entered the wsl command into my terminal, I'd get a messaged that my WSL instance had been terminated. I attempted to:

How I fixed it

To resolve this issue I had to (in order):

  1. Run PowerShell as admin
  2. Run the wsl --update command
  3. Run the wsl --shutdown command
  4. Run the wsl command

This updated and force restarted WSL, which was a simple fix!


Issues installing STOUT

I created a conda environment to install STOUT along with some other tools to run my predictions, such as Jupyter Lab. However, when attempting to use the conda command (conda install -c decimer stout-pypi) to install STOUT, it would always fail.

Image of the error received (the exact conflicts varied based on what I tried):

image

I tried to get it to work via conda by:

How I fixed it

Eventually, I used pip to install the STOUT package. It installed this way without any issues.

DhanshreeA commented 8 months ago

Hi @NaliH thank you for the updates. Let us know how it goes with trying out the STOUT model with the EML file, and then testing Ersilia's implementation on the same file. Let's know here if you get stuck and need help.

NaliH commented 8 months ago

Week 2

Task 3: Run predictions


Issue encountered:

Upon first attempting to import STOUT into my notebook, I received an error regarding JVM.

How I fixed it

After some troubleshooting, I realized that I needed to install Java to successfully run the model and its functions.


Running the predictions

Tools used: • WSL • Conda environment • STOUT • Jupyter Lab • Pandas

I created a Jupyter Notebook for using STOUT with the EML data. The full notebook is available here.

In order to run the model on the EML data, I did the following:

Imported STOUT and Pandas to perform the translations

image


Saved the EML data as a CSV file and imported it into my notebook with Pandas as a DataFrame.

image


Created a function to run the translations on the cells of the DataFrame that contain the SMILES labels and saved that into a cell in a column for the IUPAC names.

image


This data was then stored into another CSV with the newly translated data.

image


I again read this data into a DataFrame with Pandas. I created a function to assist in the quick retrieval of the IUPAC names, as opposed to repeatedly running STOUT.

image

image


Additionally, out of curiosity, I tested the runtime of using STOUT vs using the stored data. Running the model on all of the data took longer than I expected, so I was glad that I only needed to run STOUT once on a given string and then store it for later retrieval.

image

The STOUT method translate_forward vs retrieving from the DataFrame
NaliH commented 8 months ago

Week 2

Task 4: Understand Ersilia's backend | Compare results with the Ersilia models


Issue encountered:

I had difficulty getting the Ersilia model to consistently perform predictions. With each method I used, I ended up receiving null for various predictions.

Methods used to use the model:

As a Python Package

Imports and functions used:

I imported Ersilia into my program and was able to successfully fetch and serve the model. I initially ran the model on a single SMILES label and this ran without issue. I increased the amount of data to 5 labels and this also ran without issue. Seeing this, I attempted the full dataset.

However, when running the model on the full EML data (all 400+), and after 7 hours, I would receive null as the output for all of the predictions.

Via Shell Commands

Commands used:

I attempted to fetch and serve Ersilia two ways - from GitHub and from Docker. I used the slug each time. I was able to do this successfully with both options. However, each had their own issues with running the model.

From Docker (pulled from container) As mentioned, the model ran inconsistently. Similarly to using Ersilia as a Python package, I initially ran it with a single SMILES label and increased the amount of data used as input. I was unable to get results with larger numbers of data input. Curiously, after some time, I was also receiving an issue on smaller amounts of 5 smiles labels.

From GitHub I received a connection refused error (Errno 111) when running the model. I attempted this method briefly before going back to using Docker.


Running predictions

Tools used:

I created a Jupyter Notebook for using Ersilia with the EML data. The notebook is available here.

I was able to successfully run the model on 10 SMILES labels (broken into chunks of 5) and compared the results to the STOUT model.

In order to run the model on the EML data for comparisons, I did the following:

Imported Pandas to read the data and create DataFrames to perform the translations and comparisons

image

Saved the EML data as a CSV file and imported it into my notebook with Pandas as a DataFrame

image

Fetched and served Ersilia

image

Created sets of data to be ran through the model

image

Ran the model on the sets of data

image

Created a DataFrame for comparisons and compared the prediction results

image

Created a function to retrieve the comparison results

image

Examples of comparing the predictions

image

image


Results

From the amount of data compared (10 SMILES labels), the results were split down the middle - 5 results were similar and 5 were different.

image

For example, the SMILES label CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 had the following results:

STOUT: (E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide

Ersilia: (E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide

The full comparison results, with the SMILES labels, IUPAC names, and comparison status is here.

NaliH commented 8 months ago

Week 3

Task 1: First Model Suggestion


Model: elEmBERT - element Embeddings and Bidirectional Encoder Representations from Transformers

Paper is available here.

What the model does The elEmBERT model is a neural network (NN) model that predicts chemical properties using structural information. The atomic position of chemical compounds are input and converted into tokens to be ran through the model for chemical analysis.

Why is it relevant to Ersilia The model is relevant to Ersilia because it can be used to predict chemical properties, which aids in drug discovery. The model can be adapted and applied to various datasets, providing it with flexibility. For instance, it's been benchmarked for datasets such as:

How would you implement it? The code is available on GitHub here. The model is written in Python and uses TensorFlow. Pre-trained dictionaries are available for the model. There are example notebooks provided on usage of the model. The datasets used for the benchmarks are also provided.

NaliH commented 8 months ago

Week 3

Task 2: Second Model Suggestion


helmpy - individual-based simulation of helminth transmission in a human population

Paper is available here

What the model does The helmpy model uses a stochastic individual-based approach to forecast the transmission and control of helminth infections in infected humans. The model considers the control by mass-drug administration of soil-transmitted helminths, but it can be applied to other helminth species.

Why is it relevant to Ersilia According to WHO, soil-based helminths (such as hookworm) is one of the most common infections in the world. It's estimated that 1.5 billion, or 24% of the world's population have been affected by it. This infection is one of the many neglected tropical diseases (NTDs) that is common in low-income tropical/subtropical regions. As part of Ersilia's mission to support research of NTDs, this model can serve as a means to help combat helminth infections.

How would you implement it The code is available to be forked on GitHub here. The model provides interactive notebooks for implementation of the code, with documentation on how to use the notebook/model.

NaliH commented 8 months ago

Week 3

Task 3: Third Model Suggestion


PaddleHelix

GitHub: Paper is available here Docs are available here

What the model does The PaddleHelix model uses deep neural networks (such as GNN) to search the chemical space for drug discovery. A scoring system is utilized on a molecule based on its properties such as bio-activity to a target protein, druggability, and synthetic accessibility. A generative method is then used on the score to predict similarities between the original and potential molecules.

Why is it relevant to Ersilia The model is relevant to Ersilia because it uses a ML approach to drug discovery, vaccine design, and precision medicine. Drug discovery often entails lab experiments that can be expensive and time consuming. According to WHO, average cost to develop a new drug ranges from US$43.4 million to US$4.2 billion. As one of Ersilia's goals is to facilitate drug discovery in low/middle income countries, incorporating a model to aid in the development process would be highly beneficial, as it would reduce some of the costs required in the drug discovery process.

How would you implement it The code is available on GitHub here. It is written primarily in Python, but requires some C++ for development. The model provides:

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!