✍️ Contribution period: Alphonse Brandon

ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

198 stars 128 forks source link

✍️ Contribution period: Alphonse Brandon #852

Closed AlphonseBrandon closed 10 months ago

AlphonseBrandon commented 10 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

AlphonseBrandon commented 10 months ago

Motivation Statement To Work at Ersilia

I am enthusiastic to contribute as an Outreachy intern to Ersilia's mission which is to equip laboratories in Low and middle-income countries with state-of-the-art AI/ML tools for infectious and neglected disease research.

Ersilia's goals will have a direct impact on my country Cameroon and my immediate community in Buea where a pilot program is currently running at the Centre For Drug Discovery at the University of Buea (of which I am an alumnus and have a good relationship with the professors)

Additionally, Ersilia's open-source project is a unique opportunity for me to apply my data science and machine learning knowledge to solve real-world problems that will have a major impact in my community and the global south (sub-Saharan Africa inclusive).

Furthermore, I am impressed with the Ersilia team and their commitment to mentorship of interns, a perfect environment for me to thrive.

Thank you for considering my application and I look forward to the possibility of joining the Ersilia team as an intern for the Outreachy winter 2023 round.

AlphonseBrandon commented 10 months ago

WEEK 1 - SUMMARY

Task 1: Joined Communication Channels

One of my first tasks was to join Ersilia's communication channels. This step allowed me to connect with the project's community, understand its values, and engage in discussions with other applicants in the contribution phase

Task 2: Opened an Issue

I opened an issue (this template to track my contributions provided by the Ersilia team). This issue serves as a starting point for collaboration and discussion within the community, highlighting my commitment to actively contribute to the project's growth.

Task 3: Installed the Ersilia Model Hub

To better understand the platform and its functionality, I installed the Ersilia Model Hub locally. This hands-on experience allowed me to gain insights into the current system and identify potential areas for enhancement.

Task 4: Wrote a Motivation Statement

As part of my application to the Outreachy program with Ersilia, I wrote a motivation statement explaining why I am eager to work with the organization. This statement reflects my genuine passion for Ersilia's mission and my desire to contribute my skills to support its goals.

Task 5: Opened an Outreachy Application for the project and recorded my first contribution

This step demonstrates my commitment to becoming an integral part of the Ersilia community and contributing to the success of the Ersilia Model Hub.

Round up: I attended an initial call with the executives and mentors at Ersilia

This was the icing on the cake for week 1 and I got to learn from the Ersilia executes the impact of this project in my immediate community with the work going on at the University of Buea - Centre of Drug Discovery

These tasks represent my initial efforts to become an active participant in the Ersilia open-source project. I am excited about the journey ahead and look forward to collaborating with the Ersilia team to enhance the Ersilia Model Hub, ultimately contributing to the advancement of AI/ML models for biomedical research and furthering Ersilia's mission.

I am grateful for the opportunity to be a part of this project, and I am excited to learn, grow, and make a positive impact within the Ersilia community.

AlphonseBrandon commented 10 months ago

WEEK 2 - OVERVIEW

This week looks promising as I look forward to completing most of the contribution phase tasks this week, pumped up by how interesting the project has been so far.

Goals

This week's goal is to do the following:

Demonstrate my Python knowledge
Practise following documentation and implementing third-party code
Working with dependencies and conda environments

Tasks

This week I have the following task on my todo list:

Selecting a model
Installing the model
Run predictions
Understand Ersilia's backend
Compare results with the Ersilia models

I am definitely going to have a lot of fun doing these tasks, let's see how it goes.

AlphonseBrandon commented 10 months ago

Week 2 | Task 1 - select a model

Overview

I was tasked to select 1 of 4 models listed on the Ersilia Book here

Model

I selected the STOUT (SMILES to IUPAC) model. GitHub Link

Reasons

It is well documented relative to the other 4 models

I also found the paper for the model that I can use to explore further. Link to the paper

STOUT uses my favorite deep-learning framework - TensorFlow
It is actively being maintained with the last commit being 2 months ago
The paper of the STOUT model has 30 citations
STOUT uses the transformer model architecture

Transformers are common with large language models, an area I have gained special interest over the last few months.

Next step is to install the model.

AlphonseBrandon commented 10 months ago

Week 2 | Task 2 - Installation of STOUT model

Overview

I followed the installation instructions on the model GitHub repository

Installation

🔴 I encountered errors installing from conda using the command conda install -c decimer stout-pypi 🟢 So I installed through pip using pip install STOUT-pypi

Test

To test if the installation was succescull, I ran this starter code from the model's github repository

from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

Output

SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C

Output Screenshot

AlphonseBrandon commented 10 months ago

Week 2 | Task 3 - run predictions

Overview

In this task, I run predictions for the Essential Medicines List

Code Overview

The code below reads EML data from a CSV file, extracts the canonical SMILES strings from the EML data, translates the SMILES strings to IUPAC names using the STOUT library, and then prints the IUPAC names to the console and writes them to a CSV file.

The Documented Code

I did put in some effort in documenting the code with docstrings and using a modular style of programing by splitting in bits of functions so that it becomes self-explanatory

import csv
from STOUT import translate_forward

def read_eml_csv(file_path):
    """
    Reads a CSV file containing EML data and returns its contents as a list of lists.

    Args:
        file_path (str): The path to the CSV file to read.

    Returns:
        list: A list of lists containing the contents of the CSV file.
    """
    with open(file_path, newline='') as csv_file:
        reader = csv.reader(csv_file)
        return list(reader)

def get_canonical_smiles_list(eml_list):
    """
    Extracts the canonical SMILES strings from a list of EML data.

    Args:
        eml_list (list): A list of lists containing EML data.

    Returns:
        list: A list of canonical SMILES strings.
    """
    return [name[2] for name in eml_list[1:3]]

def translate_smiles_to_iupac(can_smiles_list):
    """
    Translates a list of SMILES strings to IUPAC names using the STOUT model.

    Args:
        can_smiles_list (list): A list of canonical SMILES strings.

    Returns:
        list: A list of IUPAC names.
    """
    iupac_list = []
    for smiles in can_smiles_list:
        iupac = translate_forward(smiles)
        iupac_list.append(iupac)
    return iupac_list

def print_iupac_list(iupac_list):
    """
    Prints a list of IUPAC names to the console.

    Args:
        iupac_list (list): A list of IUPAC names.
    """
    for iupac in iupac_list:
        print(iupac)

def write_iupac_csv(file_path, iupac_list):
    """
    Writes a list of IUPAC names to a CSV file.

    Args:
        file_path (str): The path to the CSV file to write.
        iupac_list (list): A list of IUPAC names.
    """
    with open(file_path, "w", newline="") as csv_file:
        writer = csv.writer(csv_file)
        writer.writerows([[iupac] for iupac in iupac_list])

# Read the EML data from a CSV file
eml_list = read_eml_csv("../data/smiles/eml_canonical.csv")

# Extract the canonical SMILES strings from the EML data
can_smiles_list = get_canonical_smiles_list(eml_list)

# Translate the SMILES strings to IUPAC names
iupac_list = translate_smiles_to_iupac(can_smiles_list)

# Print the IUPAC names to the console
print_iupac_list(iupac_list)

# Write the IUPAC names to a CSV file
write_iupac_csv("../data/iupac/predicted_iupac.csv", iupac_list)

CodeOutput

The output of the code is a list of IUPAC names that are printed to the console and written to a CSV file. Example: 13-dimethyl-17-pyridin-3-yl

AlphonseBrandon commented 10 months ago

Week 2 | Task 4 - understand Ersilia's backend

Overview

Ersilia can run by downloading models from GitHub (using Git-LFS), from S3 buckets (our AWS backend) and by downloading models as Docker containers

Downloading models as docker containers

I installed docker on my system with the command curl -fsSL https://get.docker.com/ | sh
To test if docker was installed and covered properly, I used the command
```
docker run hello-world
```

Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world b901d36b6f2f: Pull complete 0a6ba66e537a: Pull complete Digest: sha256:8be990ef2aeb16dbcb9271ddfe2610fa6658d13f6dfb8bc72074cc1ca36966a7 Status: Downloaded newer image for hello-world:latest

Hello from Docker.

This message shows that your installation appears to be working correctly.



- I proceeded to Search for STOUT model in [Ersilia's model hub](https://www.ersilia.io/model-hub)
- The stout model has the identifier `eos4se9` with a slug name `smiles2iupac`
- I then fetched and served the model using `ersilia fetch eos4se9` and `ersilia serve eos4se9` respectively
- Finally I ran the model using `ersilia -v api run -i input.csv -o result3.csv`

🔴 On running the model from an input file, I encountered an error that read
`TypeError: object of type 'NoneType' has no len()`

🟢 On running the model with an input string with this command `ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"` it ran successfully and displayed the IUPAC name from the SMILES name

🟢 The error I previously encountered: `TypeError: object of type 'NoneType' has no len()` when using the command `ersilia api run -i input.csv` has been resolved. I resolved the issue by simply restarting my system. I guess some conflicting processes where running in the background causing the error.

AlphonseBrandon commented 10 months ago

Week 2 | Task 4 - compare results with the Ersilia models

Overview

Find the selected model from task 2 in Ersilia model hub and install in my system, run predictions and compare output of the two.

Steps

Search the STOUT model from the Ersilia model hub
fetch and serve the model using ersilia fetch eos4se9 andersilia serve eos4se9 respectively
run predictions ersilia api run -i input.csv

Output comparison

Similarity

The output of the original model and the model found in the Ersilia model hub yielded the same IUPAC names converted from SMILES nomenclature.

Difference

I also noticed that, unlike the original model, the output from the same model in the Ersilia hub when printed to the console outputs in JSON format

Result

Output from STOUT model in Ersilia's model hub

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}
{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
        ]
    }
}
{
    "input": {
        "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
        "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
        "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
    },
    "output": {
        "outcome": [
            "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
        ]
    }
}
{
    "input": {
        "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
        "input": "CC(O)=O",
        "text": "CC(O)=O"
    },
    "output": {
        "outcome": [
            "aceticacid"
        ]
    }
}

AlphonseBrandon commented 10 months ago

Week 2 Tasks Completed

Recap

Here is a summary of tasks carried out

Selecting a model
Installing the model
Run predictions
Understand Ersilia's backend
Compare results with the Ersilia models

Milestones Accomplished

I can say with confidence that I have:

Demonstrated my Python knowledge
Practiced following documentation and implementing third-party code
Demonstrated working with dependencies and conda environments

Up next: Week 3.

AlphonseBrandon commented 10 months ago

Week 3 | Task 1 - A first model suggestion

Overview

For readability, I will use a template containing 5 sections for all suggested models. The sections will answer questions for these task as stated in the Ersilia cookbook here and include the following:

Paper
Model Description
Relevance to Ersilia
Model Implementation availability
Relevant links

Paper | Analyzing Learned Molecular Representations for Property Prediction

The paper describes how the authors compared different molecular representations for property prediction, including expert-crafted descriptors, computed fingerprints, and graph convolutional neural networks. This paper has 548 citations.

Model Description | The graph convolutional neural network (GCN) model.

This model uses a graph-based representation of the molecular structure, where each atom is a node and each bond is an edge. The model learns a vector representation for each node by aggregating information from its neighbors. The model then combines the node vectors to obtain a molecular vector that can be used for property prediction.

Relevance to Ersilia

This model would be relevant to Ersilia because it could help us predict various drug candidate properties and optimize them for drug discovery. It could also help us to understand the relationship between molecular structure and function and identify important substructures and motifs that influence the properties.

Model Implementation availability

To implement this model, we would need to access the code and the data that are provided by the authors. The code contains instructions on how to run the model and reproduce the results. We would need to install the required packages and dependencies and run the scripts in a suitable environment.

Relevant Links:

Link to publication click Link to code click Link to dataset click

AlphonseBrandon commented 10 months ago

Week 3 | Task 2 - A second model suggestion

Paper | Neural Message Passing for Quantum Chemistry

The paper describes a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs) that can learn their own features from molecular graphs directly and are invariant to graph isomorphism. The paper applies this framework to predict the quantum mechanical properties of small organic molecules using the QM9 dataset.

Model Description | The MPNN-EN model does the following:

It takes as input a molecular graph with atom features (such as atomic number, degree, hybridization, etc.) and bond features (such as bond type, conjugation, etc.).
It applies several layers of message passing to update the node and edge features by passing messages between neighboring nodes and edges. The messages are computed by a neural network that takes the sender and receiver node features and the edge feature as input and outputs a message vector.
It applies a global pooling operation to combine the node and edge features into a molecular feature vector. The pooling operation is a sum over all nodes and edges weighted by learnable parameters.
It applies a fully connected layer and an output layer to predict the molecular property of interest (such as solubility, toxicity, activity, etc.).

Relevance To Ersilia:

Ersilia's mission often involves an interdisciplinary approach where machine learning, chemistry, and biology intersect. Models like MPNN-EN bridge these disciplines, enabling more informed and data-driven decision-making in the context of infectious and neglected diseases.

Model Implementation availability

The code is available on GitHub. The code is written in Python and uses libraries such as PyTorch, PyTorch Geometric, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the properties of different molecules.

Relevant Links:

Link to publication click Link to model click Link to code click Link to dataset click

AlphonseBrandon commented 10 months ago

Week 3 | Task 3 - A third model suggestion

Paper | Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

The paper describes how the authors used recurrent neural networks (RNNs) to generate novel molecules with good affinity to the desired biological target. The authors trained the RNNs on large datasets of drug-like molecules and fine-tuned them on small sets of molecules that are known to be active against specific pathogens, such as Staphylococcus aureus and Plasmodium falciparum. The paper has 736 citations

Model Description | CHAR-RNN

The model does the following:

It takes as input a SMILES string with a start token and an end token.
It applies several layers of LSTM cells to encode the input SMILES into a hidden state vector.
It applies a softmax layer to decode the hidden state vector into an output character.
It repeats the decoding process until it generates an end token or reaches a maximum length.

Relevance To Ersilia:

This model would be relevant to Ersilia because it could help us to generate new drug candidates for various infectious and neglected diseases. It could also help us to explore the chemical space and discover new scaffolds and motifs that have high affinity and specificity for the target pathogens.

Model Implementation availability

The model can be installed through the model's repository on GitHub. The model's code is written in Python and uses libraries such as TensorFlow, Keras, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the activities of different molecules.

Relevant Links:

Link to publication click Link to model click Link to code click

DhanshreeA commented 10 months ago

Week 2 | Task 4 - compare results with the Ersilia models

Overview

Find the selected model from task 2 in Ersilia model hub and install in my system, run predictions and compare output of the two.

Steps

* Search the STOUT model from the [Ersilia model hub](https://www.ersilia.io/model-hub)

* fetch and serve the model using `ersilia fetch eos4se9` and` ersilia serve eos4se9` respectively

* run predictions `ersilia api run -i input.csv`

Output comparison

Similarity

The output of the original model and the model found in the Ersilia model hub yielded the same IUPAC names converted from SMILES nomenclature.

Difference

I also noticed that, unlike the original model, the output from the same model in the Ersilia hub when printed to the console outputs in JSON format

Result

Output from STOUT model in Ersilia's model hub

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}
{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
        ]
    }
}
{
    "input": {
        "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
        "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
        "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
    },
    "output": {
        "outcome": [
            "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
        ]
    }
}
{
    "input": {
        "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
        "input": "CC(O)=O",
        "text": "CC(O)=O"
    },
    "output": {
        "outcome": [
            "aceticacid"
        ]
    }
}

Hi @AlphonseBrandon thanks for the detailed updates. Could you confirm the version of STOUT-pypi that you installed on your system? I ask this because a lot of your peers are seeing different values between the latest version of STOUT and Ersilia's implementation.

Additionally, Ersilia allows to specify which output format you want the output in, and as you may have noticed, we lean strongly towards CSV files because JSON as a format becomes very 'technical', which we don't expect most of our users (scientists mainly) to be very comfortable with. You could try out printing the output from ersilia's stout implementation with csv format if you'd like.

DhanshreeA commented 10 months ago

Week 3 | Task 1 - A first model suggestion

Overview

For readability, I will use a template containing 5 sections for all suggested models. The sections will answer questions for these task as stated in the Ersilia cookbook here and include the following:
1. Paper

2. Model Description

3. Relevance to Ersilia

4. Model Implementation availability

5. Relevant links
Paper | Analyzing Learned Molecular Representations for Property Prediction

The paper describes how the authors compared different molecular representations for property prediction, including expert-crafted descriptors, computed fingerprints, and graph convolutional neural networks. This paper has 548 citations.

Model Description | The graph convolutional neural network (GCN) model.

This model uses a graph-based representation of the molecular structure, where each atom is a node and each bond is an edge. The model learns a vector representation for each node by aggregating information from its neighbors. The model then combines the node vectors to obtain a molecular vector that can be used for property prediction.

Relevance to Ersilia

This model would be relevant to Ersilia because it could help us predict various drug candidate properties and optimize them for drug discovery. It could also help us to understand the relationship between molecular structure and function and identify important substructures and motifs that influence the properties.

Model Implementation availability

To implement this model, we would need to access the code and the data that are provided by the authors. The code contains instructions on how to run the model and reproduce the results. We would need to install the required packages and dependencies and run the scripts in a suitable environment.

Relevant Links:

Link to publication click Link to code click Link to dataset click

Interesting choice @AlphonseBrandon. I see this is a representation learning framework. Could you also comment on which assays the authors have used to validate their graph representation? And which other benchmark representations/fingerprints have they carried out comparisons with? Feel free to post graphs/results from the paper with your understanding.

DhanshreeA commented 10 months ago

Week 3 | Task 2 - A second model suggestion

Paper | Neural Message Passing for Quantum Chemistry

The paper describes a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs) that can learn their own features from molecular graphs directly and are invariant to graph isomorphism. The paper applies this framework to predict the quantum mechanical properties of small organic molecules using the QM9 dataset.

Model Description | The MPNN-EN model does the following:
* It takes as input a molecular graph with atom features (such as atomic number, degree, hybridization, etc.) and bond features (such as bond type, conjugation, etc.).

* It applies several layers of message passing to update the node and edge features by passing messages between neighboring nodes and edges. The messages are computed by a neural network that takes the sender and receiver node features and the edge feature as input and outputs a message vector.

* It applies a global pooling operation to combine the node and edge features into a molecular feature vector. The pooling operation is a sum over all nodes and edges weighted by learnable parameters.

* It applies a fully connected layer and an output layer to predict the molecular property of interest (such as solubility, toxicity, activity, etc.).
Relevance To Ersilia:

Ersilia's mission often involves an interdisciplinary approach where machine learning, chemistry, and biology intersect. Models like MPNN-EN bridge these disciplines, enabling more informed and data-driven decision-making in the context of infectious and neglected diseases.

Model Implementation availability

The code is available on GitHub. The code is written in Python and uses libraries such as PyTorch, PyTorch Geometric, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the properties of different molecules.

Relevant Links:

Link to publication click Link to model click Link to code click Link to dataset click

@AlphonseBrandon interesting paper again! Could you also link the journal where you found this paper?

GemmaTuron commented 10 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!