✍️ Contribution period: Nalita Hinds

NaliH commented 9 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[X] Write a motivation statement to work at Ersilia
[X] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[X] Select a model from the suggested list
[X] Install the model in your system
[X] Run predictions for the EML
[X] Compare results with the Ersilia Model Hub implementation!
[X] Install and run Docker!

Week 3 - Propose new models

[X] Suggest a new model and document it (1)
[X] Suggest a new model and document it (2)
[X] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[X] Submit the final application in the Outreachy website

carcablop commented 9 months ago

Hello @NaliH Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.

NaliH commented 9 months ago

Hello @NaliH Welcome to Ersilia. Be sure to complete the installation steps and run a model. The complete guide can be found here: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation and report here your progress. Thanks.

Hello, @carcablop! I have successfully installed Ersilia Model Hub and all prerequisites within WSL. The installation process went smoothly - I was able to successfully test the eos3b5e model.

NaliH commented 9 months ago

Motivation Statement

My name is Nalita Hinds, and I am a recent computer science graduate with a keen interest in AI/ML.

I decided to join Outreachy due to:

The organization’s mission - creating opportunities for those that are underrepresented in technology
Being able to meet new people and join a community
Desiring to advance my skills and push myself while learning new skills
Contributing and learning more about open source

When I was first notified about being selected for the current contribution period, I saw that there was a long list of available projects – it was honestly a bit overwhelming to see them all! However, when I came across Ersilia, I immediately knew that I wanted to be a part of this project. This was solidified upon further reading of what Ersilia stands to achieve - supporting science, research, and medicine in low-income countries.

On top of that, I am highly interested in AI/ML and its implementations in healthcare and medicine. From surgery, to diagnostics, to drug discovery, to administrative workflows - there are so many applications for AI/ML! As multi-faceted as AI/ML is, I believe that there’s so much potential that still remains untapped in the field of medicine, and what current applications are there should be cultivated. I have aspirations to become a ML engineer with healthcare as a focus, and contributing to Ersilia is a huge step in that direction for me.

I am extremely excited to be able to learn more about Ersilia, AI/ML, and contribute to this project – to make a difference.

DhanshreeA commented 9 months ago

Thank you for the updates @NaliH, if you'd like you can get started with the tasks from week 2.

HellenNamulinda commented 8 months ago

Hello @NaliH, You are yet to start on week 2 tasks. Is there any way we can support you?

NaliH commented 8 months ago

Hello @NaliH, You are yet to start on week 2 tasks. Is there any way we can support you?

Hello, @HellenNamulinda I am currently in the process of completing the week 2 tasks. I will definitely reach out if I encounter any problems! I decided to work with the STOUT (SMILES to IUPAC) model and I'm setting up my environment now to be able to run predictions.

NaliH commented 8 months ago

Week 2

Task 1: Select a model

The model that I decided to work with is the STOUT (SMILES to IUPAC) model.

Why I chose this model

My reasons for choosing the STOUT model are:

More documentation is available than the other presented models
Being able to learn more about SMILES and IUPAC names via their cited research paper
Finding out about about neural machine translation (NMT) and being able to learn more about it

NaliH commented 8 months ago

Week 2

Task 2: Installing the model

While the STOUT repo details how to install it, I still ran into some issues with WSL and installing STOUT.

Issues with WSL

I encountered an issue where WSL would not start. Whenever I entered the wsl command into my terminal, I'd get a messaged that my WSL instance had been terminated. I attempted to:

Restart my console
Launch the "Ubuntu on Windows" console
Use PowerShell to start WSL
Run the wsl.exe directly
Run wsl --shutdown

How I fixed it

To resolve this issue I had to (in order):

Run PowerShell as admin
Run the wsl --update command
Run the wsl --shutdown command
Run the wsl command

This updated and force restarted WSL, which was a simple fix!

Issues installing STOUT

I created a conda environment to install STOUT along with some other tools to run my predictions, such as Jupyter Lab. However, when attempting to use the conda command (conda install -c decimer stout-pypi) to install STOUT, it would always fail.

Image of the error received (the exact conflicts varied based on what I tried):

I tried to get it to work via conda by:

Restarting the console
Creating a new environment
Updating conda
Updating all conda packages
Googling the error and trying various things

How I fixed it

Eventually, I used pip to install the STOUT package. It installed this way without any issues.

DhanshreeA commented 8 months ago

Hi @NaliH thank you for the updates. Let us know how it goes with trying out the STOUT model with the EML file, and then testing Ersilia's implementation on the same file. Let's know here if you get stuck and need help.

NaliH commented 8 months ago

Week 2

Task 3: Run predictions

Issue encountered:

Upon first attempting to import STOUT into my notebook, I received an error regarding JVM.

How I fixed it

After some troubleshooting, I realized that I needed to install Java to successfully run the model and its functions.

Running the predictions

Tools used: • WSL • Conda environment • STOUT • Jupyter Lab • Pandas

I created a Jupyter Notebook for using STOUT with the EML data. The full notebook is available here.

In order to run the model on the EML data, I did the following:

Imported STOUT and Pandas to perform the translations

Saved the EML data as a CSV file and imported it into my notebook with Pandas as a DataFrame.

Created a function to run the translations on the cells of the DataFrame that contain the SMILES labels and saved that into a cell in a column for the IUPAC names.

This data was then stored into another CSV with the newly translated data.

I again read this data into a DataFrame with Pandas. I created a function to assist in the quick retrieval of the IUPAC names, as opposed to repeatedly running STOUT.

Additionally, out of curiosity, I tested the runtime of using STOUT vs using the stored data. Running the model on all of the data took longer than I expected, so I was glad that I only needed to run STOUT once on a given string and then store it for later retrieval.

The STOUT method `translate_forward` vs retrieving from the DataFrame

NaliH commented 8 months ago

Week 2

Task 4: Understand Ersilia's backend | Compare results with the Ersilia models

Issue encountered:

I had difficulty getting the Ersilia model to consistently perform predictions. With each method I used, I ended up receiving null for various predictions.

Methods used to use the model:

Ersilia as a Python package
Ersilia via shell commands

As a Python Package

Imports and functions used:

from ersilia import ErsiliaModel
ErsiliaModel("smiles2iupac")
model.serve()
model.run(input=SMILES_Data, output="json")
model.close()

I imported Ersilia into my program and was able to successfully fetch and serve the model. I initially ran the model on a single SMILES label and this ran without issue. I increased the amount of data to 5 labels and this also ran without issue. Seeing this, I attempted the full dataset.

However, when running the model on the full EML data (all 400+), and after 7 hours, I would receive null as the output for all of the predictions.

Via Shell Commands

Commands used:

ersilia fetch smiles2iupac
ersilia fetch smiles2iupac --from_github
ersilia serve smiles2iupac
ersilia run -i 'smile_data_here' -o 'save_file_here'
ersilia close

I attempted to fetch and serve Ersilia two ways - from GitHub and from Docker. I used the slug each time. I was able to do this successfully with both options. However, each had their own issues with running the model.

From Docker (pulled from container) As mentioned, the model ran inconsistently. Similarly to using Ersilia as a Python package, I initially ran it with a single SMILES label and increased the amount of data used as input. I was unable to get results with larger numbers of data input. Curiously, after some time, I was also receiving an issue on smaller amounts of 5 smiles labels.

From GitHub I received a connection refused error (Errno 111) when running the model. I attempted this method briefly before going back to using Docker.

Running predictions

Tools used:

WSL
Conda environment
Ersilia
Jupyter Lab
Pandas

I created a Jupyter Notebook for using Ersilia with the EML data. The notebook is available here.

I was able to successfully run the model on 10 SMILES labels (broken into chunks of 5) and compared the results to the STOUT model.

In order to run the model on the EML data for comparisons, I did the following:

Imported Pandas to read the data and create DataFrames to perform the translations and comparisons

Saved the EML data as a CSV file and imported it into my notebook with Pandas as a DataFrame

Fetched and served Ersilia

Created sets of data to be ran through the model

Ran the model on the sets of data

Created a DataFrame for comparisons and compared the prediction results

Created a function to retrieve the comparison results

Examples of comparing the predictions

Results

From the amount of data compared (10 SMILES labels), the results were split down the middle - 5 results were similar and 5 were different.

For example, the SMILES label CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 had the following results:

STOUT: (E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide

Ersilia: (E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide

The full comparison results, with the SMILES labels, IUPAC names, and comparison status is here.

NaliH commented 8 months ago

Week 3

Task 1: First Model Suggestion

Model: elEmBERT - element Embeddings and Bidirectional Encoder Representations from Transformers

Paper is available here.

What the model does The elEmBERT model is a neural network (NN) model that predicts chemical properties using structural information. The atomic position of chemical compounds are input and converted into tokens to be ran through the model for chemical analysis.

Why is it relevant to Ersilia The model is relevant to Ersilia because it can be used to predict chemical properties, which aids in drug discovery. The model can be adapted and applied to various datasets, providing it with flexibility. For instance, it's been benchmarked for datasets such as:

BBBP A dataset that contains annotated data for the ability of a chemical compound to penetrate the blood-brain barrier
Clintox A dataset that provides the toxicity profile of chemical compounds.
SIDER A dataset that contains structured information on drug-associated side effects

How would you implement it? The code is available on GitHub here. The model is written in Python and uses TensorFlow. Pre-trained dictionaries are available for the model. There are example notebooks provided on usage of the model. The datasets used for the benchmarks are also provided.

NaliH commented 8 months ago

Week 3

Task 2: Second Model Suggestion

helmpy - individual-based simulation of helminth transmission in a human population

Paper is available here

What the model does The helmpy model uses a stochastic individual-based approach to forecast the transmission and control of helminth infections in infected humans. The model considers the control by mass-drug administration of soil-transmitted helminths, but it can be applied to other helminth species.

Why is it relevant to Ersilia According to WHO, soil-based helminths (such as hookworm) is one of the most common infections in the world. It's estimated that 1.5 billion, or 24% of the world's population have been affected by it. This infection is one of the many neglected tropical diseases (NTDs) that is common in low-income tropical/subtropical regions. As part of Ersilia's mission to support research of NTDs, this model can serve as a means to help combat helminth infections.

How would you implement it The code is available to be forked on GitHub here. The model provides interactive notebooks for implementation of the code, with documentation on how to use the notebook/model.

NaliH commented 8 months ago

Week 3

Task 3: Third Model Suggestion

PaddleHelix

GitHub: Paper is available here Docs are available here

What the model does The PaddleHelix model uses deep neural networks (such as GNN) to search the chemical space for drug discovery. A scoring system is utilized on a molecule based on its properties such as bio-activity to a target protein, druggability, and synthetic accessibility. A generative method is then used on the score to predict similarities between the original and potential molecules.

Why is it relevant to Ersilia The model is relevant to Ersilia because it uses a ML approach to drug discovery, vaccine design, and precision medicine. Drug discovery often entails lab experiments that can be expensive and time consuming. According to WHO, average cost to develop a new drug ranges from US$43.4 million to US$4.2 billion. As one of Ersilia's goals is to facilitate drug discovery in low/middle income countries, incorporating a model to aid in the development process would be highly beneficial, as it would reduce some of the costs required in the drug discovery process.

How would you implement it The code is available on GitHub here. It is written primarily in Python, but requires some C++ for development. The model provides:

An installation guide
A developer guide
Multiple tutorials
Interactive examples
Datasets

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia