Closed AlphonseBrandon closed 10 months ago
I am enthusiastic to contribute as an Outreachy intern to Ersilia's mission which is to equip laboratories in Low and middle-income countries with state-of-the-art AI/ML tools for infectious and neglected disease research.
Ersilia's goals will have a direct impact on my country Cameroon and my immediate community in Buea where a pilot program is currently running at the Centre For Drug Discovery at the University of Buea (of which I am an alumnus and have a good relationship with the professors)
Additionally, Ersilia's open-source project is a unique opportunity for me to apply my data science and machine learning knowledge to solve real-world problems that will have a major impact in my community and the global south (sub-Saharan Africa inclusive).
Furthermore, I am impressed with the Ersilia team and their commitment to mentorship of interns, a perfect environment for me to thrive.
Thank you for considering my application and I look forward to the possibility of joining the Ersilia team as an intern for the Outreachy winter 2023 round.
One of my first tasks was to join Ersilia's communication channels. This step allowed me to connect with the project's community, understand its values, and engage in discussions with other applicants in the contribution phase
I opened an issue (this template to track my contributions provided by the Ersilia team). This issue serves as a starting point for collaboration and discussion within the community, highlighting my commitment to actively contribute to the project's growth.
To better understand the platform and its functionality, I installed the Ersilia Model Hub locally. This hands-on experience allowed me to gain insights into the current system and identify potential areas for enhancement.
As part of my application to the Outreachy program with Ersilia, I wrote a motivation statement explaining why I am eager to work with the organization. This statement reflects my genuine passion for Ersilia's mission and my desire to contribute my skills to support its goals.
This step demonstrates my commitment to becoming an integral part of the Ersilia community and contributing to the success of the Ersilia Model Hub.
This was the icing on the cake for week 1 and I got to learn from the Ersilia executes the impact of this project in my immediate community with the work going on at the University of Buea - Centre of Drug Discovery
These tasks represent my initial efforts to become an active participant in the Ersilia open-source project. I am excited about the journey ahead and look forward to collaborating with the Ersilia team to enhance the Ersilia Model Hub, ultimately contributing to the advancement of AI/ML models for biomedical research and furthering Ersilia's mission.
I am grateful for the opportunity to be a part of this project, and I am excited to learn, grow, and make a positive impact within the Ersilia community.
This week looks promising as I look forward to completing most of the contribution phase tasks this week, pumped up by how interesting the project has been so far.
This week's goal is to do the following:
This week I have the following task on my todo list:
I am definitely going to have a lot of fun doing these tasks, let's see how it goes.
I was tasked to select 1 of 4 models listed on the Ersilia Book here
I selected the STOUT (SMILES to IUPAC) model. GitHub Link
I also found the paper for the model that I can use to explore further. Link to the paper
Transformers are common with large language models, an area I have gained special interest over the last few months.
Next step is to install the model.
I followed the installation instructions on the model GitHub repository
🔴 I encountered errors installing from conda using the command
conda install -c decimer stout-pypi
🟢 So I installed through pip
using pip install STOUT-pypi
To test if the installation was succescull, I ran this starter code from the model's github repository
from STOUT import translate_forward, translate_reverse
# SMILES to IUPAC name translation
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
# IUPAC name to SMILES translation
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)
SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C
In this task, I run predictions for the Essential Medicines List
The code below reads EML data from a CSV file, extracts the canonical SMILES strings from the EML data, translates the SMILES strings to IUPAC names using the STOUT library, and then prints the IUPAC names to the console and writes them to a CSV file.
I did put in some effort in documenting the code with docstrings and using a modular style of programing by splitting in bits of functions so that it becomes self-explanatory
import csv
from STOUT import translate_forward
def read_eml_csv(file_path):
"""
Reads a CSV file containing EML data and returns its contents as a list of lists.
Args:
file_path (str): The path to the CSV file to read.
Returns:
list: A list of lists containing the contents of the CSV file.
"""
with open(file_path, newline='') as csv_file:
reader = csv.reader(csv_file)
return list(reader)
def get_canonical_smiles_list(eml_list):
"""
Extracts the canonical SMILES strings from a list of EML data.
Args:
eml_list (list): A list of lists containing EML data.
Returns:
list: A list of canonical SMILES strings.
"""
return [name[2] for name in eml_list[1:3]]
def translate_smiles_to_iupac(can_smiles_list):
"""
Translates a list of SMILES strings to IUPAC names using the STOUT model.
Args:
can_smiles_list (list): A list of canonical SMILES strings.
Returns:
list: A list of IUPAC names.
"""
iupac_list = []
for smiles in can_smiles_list:
iupac = translate_forward(smiles)
iupac_list.append(iupac)
return iupac_list
def print_iupac_list(iupac_list):
"""
Prints a list of IUPAC names to the console.
Args:
iupac_list (list): A list of IUPAC names.
"""
for iupac in iupac_list:
print(iupac)
def write_iupac_csv(file_path, iupac_list):
"""
Writes a list of IUPAC names to a CSV file.
Args:
file_path (str): The path to the CSV file to write.
iupac_list (list): A list of IUPAC names.
"""
with open(file_path, "w", newline="") as csv_file:
writer = csv.writer(csv_file)
writer.writerows([[iupac] for iupac in iupac_list])
# Read the EML data from a CSV file
eml_list = read_eml_csv("../data/smiles/eml_canonical.csv")
# Extract the canonical SMILES strings from the EML data
can_smiles_list = get_canonical_smiles_list(eml_list)
# Translate the SMILES strings to IUPAC names
iupac_list = translate_smiles_to_iupac(can_smiles_list)
# Print the IUPAC names to the console
print_iupac_list(iupac_list)
# Write the IUPAC names to a CSV file
write_iupac_csv("../data/iupac/predicted_iupac.csv", iupac_list)
The output of the code is a list of IUPAC names that are printed to the console and written to a CSV file. Example:
13-dimethyl-17-pyridin-3-yl
Ersilia can run by downloading models from GitHub (using Git-LFS), from S3 buckets (our AWS backend) and by downloading models as Docker containers
curl -fsSL https://get.docker.com/ | sh
docker run hello-world
Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world b901d36b6f2f: Pull complete 0a6ba66e537a: Pull complete Digest: sha256:8be990ef2aeb16dbcb9271ddfe2610fa6658d13f6dfb8bc72074cc1ca36966a7 Status: Downloaded newer image for hello-world:latest
Hello from Docker.
This message shows that your installation appears to be working correctly.
- I proceeded to Search for STOUT model in [Ersilia's model hub](https://www.ersilia.io/model-hub)
- The stout model has the identifier `eos4se9` with a slug name `smiles2iupac`
- I then fetched and served the model using `ersilia fetch eos4se9` and `ersilia serve eos4se9` respectively
- Finally I ran the model using `ersilia -v api run -i input.csv -o result3.csv`
🔴 On running the model from an input file, I encountered an error that read
`TypeError: object of type 'NoneType' has no len()`
🟢 On running the model with an input string with this command `ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"` it ran successfully and displayed the IUPAC name from the SMILES name
🟢 The error I previously encountered: `TypeError: object of type 'NoneType' has no len()` when using the command `ersilia api run -i input.csv` has been resolved. I resolved the issue by simply restarting my system. I guess some conflicting processes where running in the background causing the error.
Find the selected model from task 2 in Ersilia model hub and install in my system, run predictions and compare output of the two.
ersilia fetch eos4se9
andersilia serve eos4se9
respectivelyersilia api run -i input.csv
The output of the original model and the model found in the Ersilia model hub yielded the same IUPAC names converted from SMILES nomenclature.
I also noticed that, unlike the original model, the output from the same model in the Ersilia hub when printed to the console outputs in JSON format
{
"input": {
"key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
"input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
"text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
},
"output": {
"outcome": [
"[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
]
}
}
{
"input": {
"key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
"input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
"text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
},
"output": {
"outcome": [
"(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
]
}
}
{
"input": {
"key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
"input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
"text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
},
"output": {
"outcome": [
"N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
]
}
}
{
"input": {
"key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
"input": "CC(O)=O",
"text": "CC(O)=O"
},
"output": {
"outcome": [
"aceticacid"
]
}
}
Here is a summary of tasks carried out
I can say with confidence that I have:
Up next: Week 3.
For readability, I will use a template containing 5 sections for all suggested models. The sections will answer questions for these task as stated in the Ersilia cookbook here and include the following:
The paper describes how the authors compared different molecular representations for property prediction, including expert-crafted descriptors, computed fingerprints, and graph convolutional neural networks. This paper has 548 citations.
This model uses a graph-based representation of the molecular structure, where each atom is a node and each bond is an edge. The model learns a vector representation for each node by aggregating information from its neighbors. The model then combines the node vectors to obtain a molecular vector that can be used for property prediction.
This model would be relevant to Ersilia because it could help us predict various drug candidate properties and optimize them for drug discovery. It could also help us to understand the relationship between molecular structure and function and identify important substructures and motifs that influence the properties.
To implement this model, we would need to access the code and the data that are provided by the authors. The code contains instructions on how to run the model and reproduce the results. We would need to install the required packages and dependencies and run the scripts in a suitable environment.
Link to publication click Link to code click Link to dataset click
The paper describes a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs) that can learn their own features from molecular graphs directly and are invariant to graph isomorphism. The paper applies this framework to predict the quantum mechanical properties of small organic molecules using the QM9 dataset.
Ersilia's mission often involves an interdisciplinary approach where machine learning, chemistry, and biology intersect. Models like MPNN-EN bridge these disciplines, enabling more informed and data-driven decision-making in the context of infectious and neglected diseases.
The code is available on GitHub. The code is written in Python and uses libraries such as PyTorch, PyTorch Geometric, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the properties of different molecules.
Link to publication click Link to model click Link to code click Link to dataset click
The paper describes how the authors used recurrent neural networks (RNNs) to generate novel molecules with good affinity to the desired biological target. The authors trained the RNNs on large datasets of drug-like molecules and fine-tuned them on small sets of molecules that are known to be active against specific pathogens, such as Staphylococcus aureus and Plasmodium falciparum. The paper has 736 citations
The model does the following:
This model would be relevant to Ersilia because it could help us to generate new drug candidates for various infectious and neglected diseases. It could also help us to explore the chemical space and discover new scaffolds and motifs that have high affinity and specificity for the target pathogens.
The model can be installed through the model's repository on GitHub. The model's code is written in Python and uses libraries such as TensorFlow, Keras, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the activities of different molecules.
Link to publication click Link to model click Link to code click
Week 2 | Task 4 - compare results with the Ersilia models
Overview
Find the selected model from task 2 in Ersilia model hub and install in my system, run predictions and compare output of the two.
Steps
* Search the STOUT model from the [Ersilia model hub](https://www.ersilia.io/model-hub) * fetch and serve the model using `ersilia fetch eos4se9` and` ersilia serve eos4se9` respectively * run predictions `ersilia api run -i input.csv`
Output comparison
Similarity
The output of the original model and the model found in the Ersilia model hub yielded the same IUPAC names converted from SMILES nomenclature.
Difference
I also noticed that, unlike the original model, the output from the same model in the Ersilia hub when printed to the console outputs in JSON format
Result
Output from STOUT model in Ersilia's model hub
{ "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1" }, "output": { "outcome": [ "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol" ] } } { "input": { "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N", "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5", "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5" }, "output": { "outcome": [ "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol" ] } } { "input": { "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N", "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O", "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O" }, "output": { "outcome": [ "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide" ] } } { "input": { "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N", "input": "CC(O)=O", "text": "CC(O)=O" }, "output": { "outcome": [ "aceticacid" ] } }
Hi @AlphonseBrandon thanks for the detailed updates. Could you confirm the version of STOUT-pypi that you installed on your system? I ask this because a lot of your peers are seeing different values between the latest version of STOUT and Ersilia's implementation.
Additionally, Ersilia allows to specify which output format you want the output in, and as you may have noticed, we lean strongly towards CSV files because JSON as a format becomes very 'technical', which we don't expect most of our users (scientists mainly) to be very comfortable with. You could try out printing the output from ersilia's stout implementation with csv format if you'd like.
Week 3 | Task 1 - A first model suggestion
Overview
For readability, I will use a template containing 5 sections for all suggested models. The sections will answer questions for these task as stated in the Ersilia cookbook here and include the following:
1. Paper 2. Model Description 3. Relevance to Ersilia 4. Model Implementation availability 5. Relevant links
Paper | Analyzing Learned Molecular Representations for Property Prediction
The paper describes how the authors compared different molecular representations for property prediction, including expert-crafted descriptors, computed fingerprints, and graph convolutional neural networks. This paper has 548 citations.
Model Description | The graph convolutional neural network (GCN) model.
This model uses a graph-based representation of the molecular structure, where each atom is a node and each bond is an edge. The model learns a vector representation for each node by aggregating information from its neighbors. The model then combines the node vectors to obtain a molecular vector that can be used for property prediction.
Relevance to Ersilia
This model would be relevant to Ersilia because it could help us predict various drug candidate properties and optimize them for drug discovery. It could also help us to understand the relationship between molecular structure and function and identify important substructures and motifs that influence the properties.
Model Implementation availability
To implement this model, we would need to access the code and the data that are provided by the authors. The code contains instructions on how to run the model and reproduce the results. We would need to install the required packages and dependencies and run the scripts in a suitable environment.
Relevant Links:
Link to publication click Link to code click Link to dataset click
Interesting choice @AlphonseBrandon. I see this is a representation learning framework. Could you also comment on which assays the authors have used to validate their graph representation? And which other benchmark representations/fingerprints have they carried out comparisons with? Feel free to post graphs/results from the paper with your understanding.
Week 3 | Task 2 - A second model suggestion
Paper | Neural Message Passing for Quantum Chemistry
The paper describes a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs) that can learn their own features from molecular graphs directly and are invariant to graph isomorphism. The paper applies this framework to predict the quantum mechanical properties of small organic molecules using the QM9 dataset.
Model Description | The MPNN-EN model does the following:
* It takes as input a molecular graph with atom features (such as atomic number, degree, hybridization, etc.) and bond features (such as bond type, conjugation, etc.). * It applies several layers of message passing to update the node and edge features by passing messages between neighboring nodes and edges. The messages are computed by a neural network that takes the sender and receiver node features and the edge feature as input and outputs a message vector. * It applies a global pooling operation to combine the node and edge features into a molecular feature vector. The pooling operation is a sum over all nodes and edges weighted by learnable parameters. * It applies a fully connected layer and an output layer to predict the molecular property of interest (such as solubility, toxicity, activity, etc.).
Relevance To Ersilia:
Ersilia's mission often involves an interdisciplinary approach where machine learning, chemistry, and biology intersect. Models like MPNN-EN bridge these disciplines, enabling more informed and data-driven decision-making in the context of infectious and neglected diseases.
Model Implementation availability
The code is available on GitHub. The code is written in Python and uses libraries such as PyTorch, PyTorch Geometric, RDKit, Scikit-learn, and Pandas. The data consists of several datasets that contain the SMILES strings and the properties of different molecules.
Relevant Links:
Link to publication click Link to model click Link to code click Link to dataset click
@AlphonseBrandon interesting paper again! Could you also link the journal where you found this paper?
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application