ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
198 stars 128 forks source link

✍️ Contribution period: Maureen Mugo #822

Closed maureen-mugo closed 10 months ago

maureen-mugo commented 11 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

maureen-mugo commented 11 months ago

Task 1: Join the Communication Channels
I joined the official communication channel, Slack, through the invitation link provided and introduced myself.

Task 2: Open a Gihub Issue (this one!)
I opened the issue (this one).

DhanshreeA commented 11 months ago

Hi @maureen-mugo please go through the rest of the steps for week 1 and report your progress, or any issues you run into here. Thanks.

maureen-mugo commented 11 months ago

Task 3: Install the Ersilia Model Hub and test the simplest model
I followed the instructions provided here.

  1. Installing Pre-requisites
    I'm working on a linux computer and I have Ubuntu v22.04.
    I also have mamba instead of conda, and I had already installed Git LFS.
    So my first step was creating the ersilia environment the Python version 3.7 using mamba create -n ersilia python=3.7
    1

Next, I activated the environment using
mamba activate ersilia

Next, I installed Isaura data lake next using python -m pip install isaura==0.1 in the ersilia environment 6

Lastly, I installed docker and confirmed its working
7

  1. Installing Ersilia
    I installed Ersilia Python package by clone from github using git clone https://github.com/ersilia-os/ersilia.git
    2

I then changed my directory to ersilia using cd ersilia and installed with developers mode using pip install -e .
3

Next, I checked if everything is working well. I checked ersilia CLI options using ersilia --help
4

I also checked ersilia's model catalog using ersilia catalog 5

All worked well.

  1. Testing a simple model
    I used ersilia -v fetch eos3b5e to get the model
    8

then ersilia serve eos3b5e
9

I then attempted to calculate the molecular weight of the molecules using ersilia -v api calculate -i "CCCC" but got an error
10

which was discussed here. Hellen suggested the use of ersilia -v run -i "CCCC" instead. This gave me a type error in the file.py file as shown below
11

I changed the suggested line 321 of file.py file as shown
12

I then re-run ersilia -v run -i "CCCC which gave me the output 13

maureen-mugo commented 11 months ago

Task 4: Write a motivation statement to work at Ersilia
My name is Maureen Mugo, and I am an aspiring Machine Learning Engineer and Data Scientist from Kenya with a strong skillset in Python, Machine Learning, PyTorch, and CUDA. After thoroughly researching your organization and learning about your mission, goals, and roadmap, I am convinced that this internship aligns perfectly with my aspirations and career path. Ersilia's dedication to promoting open science and leveraging artificial intelligence and machine learning (AI/ML) for biomedical research is both inspiring and commendable. Your commitment to addressing neglected diseases and improving healthcare access in Low and Middle-Income countries resonates deeply with me.

I have always been passionate about the use of Artificial Intelligence in health and medicine. Even though I have little knowledge in biochemistry and drug discovery, I would appreciate the opportunity to intern at Ersilia as I will sharpen my skills and expand my knowledge in the medical field. Equally, I believe working with the great minds and talent at Ersilia will help me to not only improve my skills, but also contribute to Ersilia's success.

I am excited about the opportunity to advance my skills in Machine Learning and Docker through this internship. The knowledge and experience I gain during this internship will help me to achieve my long-term goals, that is, applying my knowledge and experience in health sector either through research or development of models.

I look forward to the possibility of contributing to the important work of Ersilia and furthering my career in a field that I am deeply passionate about.

maureen-mugo commented 11 months ago

Task 5: Submit your first contribution to the Outreachy site
I've just finished my first contribution on the Outreachy site.

carcablop commented 11 months ago

Hello @maureen-mugo Thank you for your contribution and interest in Ersilia.

Some suggestions to keep in mind:

maureen-mugo commented 11 months ago

Hi @carcablop. Thank you for your input on my contribution. I really appreciate the feedback. I'll try re-installing Ersilia and run it again. I'll report any issue that I encounter or update if I'll be able to run it successfully.

carcablop commented 11 months ago

@maureen-mugo I'm glad it's already working. Could you please share the output ?. for example you can run this command: ersilia -v run -i "CCC" > eos3b5e.log 2>&1. Thanks

maureen-mugo commented 10 months ago

Hey @carcablop. I was able to reinstall it on Python version 3.10 but still got an error. However, after going through this discussion, I reinstalled everything from scratch and did not install Isaura Lake. Everything is working well. This is the output.

Once again, thank you for your feedback and guidance.

maureen-mugo commented 10 months ago

Can I proceed to week 2 tasks?

maureen-mugo commented 10 months ago

Week 2 - Install and run an ML model

Task 6: Select a model from the suggested list
After going through the list of models provided here, I have selected SARS-CoV2 activity (Image Mol) model. After reading its publication paper Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework, I was specifically drawn to explore the model as it helps better drug discovery and molecular target prediction.

I'm looking forward to learning more about the model as I interact with it.

maureen-mugo commented 10 months ago

Task 7: Install the model in your system I followed the instructions provided here to install the model.

I already have a CUDA 10.1 machine so I proceeded with other steps.

  1. Creating a new environment First, I created an imagemol environment with Python version 3.7.3 using the mamba create -n imagemol python=3.7.3. I then proceeded to activate the environment using mamba activate imagemol.

  2. Installing packages
    The next step was install the following packages:

    • rdkit: installed using mamba install -c rdkit rdkit
    • PyTorch: installed using pip install torch==1.4.0 torchvision==0.5.0
    • torch cluster: installed using mamba install pytorch-cluster -c pyg
    • torch-scatter: installed using mamba install pytorch-scatter -c pyg
    • torch-sparse: installed using mamba install pytorch-sparse -c pyg
    • torch-spline-conv: installed using mamba install pytorch-spline-conv -c pyg

Lastly, I cloned into their repository using git clone git@github.com:HongxinXiang/ImageMol.git. I then changed my directory using cd ImageMol and downloaded the requirements using pip install -r requirements.txt.

maureen-mugo commented 10 months ago

Switching to STOUT

I have decided to switch to the stout model due to various reasons. First, when training the model using the instructions provided here my GPU took 16 hours to preprocess the data as it's very big. I used the following code: python ./data_process/smiles2img_pretrain.py --dataroot ./datasets/pretraining/ --dataset data. This wasn't sustainable as it would take me a long time to train and fine-tune the model. Also, I didn't quite understand their documentation on how to implement their pre-trained models to make predictions. Even so, I was able to run Ersilia equivalent model eos4cxk and got the required output from it.
As I am not able to compare it with ImageMol, I couldn't proceed with the model thus switching to STOUT. In my free time, I will try figure it out as I was really interested in it.

The reason why I chose STOUT is because of its ability to help scientists and researchers minimize human error when dealing with molecule names. After reading its publication paper (SMILES-TO-IUPAC-name translator), I understand the STOUT is a deep-learning neural machine translation model used generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name.

maureen-mugo commented 10 months ago

Task 7: Install the model in your system
I followed the instructions provided here.
First, I created a stout environment using mamba create --name STOUT python=3.8 and then activated the environment using mamba activate STOUT.
I then installed decimer using mamba install -c decimer stout-pypi. Lastly, I installed its setup using pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git as instructed.

I ran a test to check if it's working using this:


from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

and the output was:

IUPAC name of CN1C=NC2=C1C(=O)N(C(=O)N2C)C is: 1,3,7-trimethylpurine-2,6-dione
SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C

for the translate forward and translate reverse respectively.

maureen-mugo commented 10 months ago

Task 8: Run predictions for the EML
First I downloaded the Essential Medical List by Ersilia and copied it to my working directory. I then used the following code to get transfer forward and transfer reverse outputs from the first 7 columns of smile in the eml_canonical.csv file:


import pandas as pd
from STOUT import translate_forward, translate_reverse

df = pd.read_csv('eml_canonical.csv')
smiles = df.smiles.to_list()[:7] # taking 7 sample SMILES for our example
iupac_names = [] # to store our converted IUPAC Names

print('Performing Translate Forward on our 7 sample SMILES')
print('*'*20)
# Translate Forward
for item in smiles:
    IUPAC_name = translate_forward(item)
    iupac_names.append(IUPAC_name) # add to our list
    print("IUPAC name of "+item+" is: "+IUPAC_name)

print('\n')
print('Performing Translate Reverse on our 7 sample IUPAC NAMES')
print('*'*20)
# Translate Reverse
for item in iupac_names:
    SMILES = translate_reverse(item)
    print("SMILES of "+ item +" is: "+SMILES)

The output is as follows:

Performing Translate Forward on our 7 sample SMILES
********************
IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 is: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
IUPAC name of C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 is: (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
IUPAC name of CC(=O)Nc1sc(nn1)[S](N)(=O)=O is: N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
IUPAC name of CC(O)=O is: aceticacid
IUPAC name of CC(=O)N[C@@H](CS)C(O)=O is: (2R)-2-acetamido-3-sulfanylpropanoicacid
IUPAC name of CC(=O)Oc1ccccc1C(O)=O is: 2-acetyloxybenzoicacid
IUPAC name of NC1=NC(=O)c2ncn(COCCO)c2N1 is: 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one

Performing Translate Reverse on our 7 sample IUPAC NAMES
********************
SMILES of [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol is: C1=C[C@@H](C[C@@H]1CO)N2C=NC3=C2N=C(N)N=C3NC4CC4
SMILES of (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol is: C[C@@]12CC[C@@H](CC2=CC[C@H]3[C@@H]4CC=C(C5=CN=CC=C5)[C@@]4(C)CC[C@@H]31)O
SMILES of N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide is: CC(=O)NC1=NN=C(S1)S(=O)(=O)N
SMILES of aceticacid is: CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC
SMILES of (2R)-2-acetamido-3-sulfanylpropanoicacid is: CC(=O)N[C@@H](CS)C(=O)O
SMILES of 2-acetyloxybenzoicacid is: CC(=O)OC1=C(C=CC=C1)C(=O)O.CC(=O)OC1=C(C=CC=C1)C(=O)O
SMILES of 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one is: C(COCN1C=NC2=C1NC(=NC2=O)N)O

For this, I used STOUT 2.0.6

maureen-mugo commented 10 months ago

Task 9: Compare results with the Ersilia Model Hub implementation
I looked for the STOUT equivalent model from Ersilia Model Hub and got the SMILES to IUPAC name translator model. I used the following code to fetch, serve and run the model:

import pandas as pd
from ersilia import ErsiliaModel

df = pd.read_csv('eml_canonical.csv')
smiles = df.smiles.to_list()[:7] # taking 7 sample SMILES for our example

model_name = "eos4se9"

!ersilia fetch $model_name #fetching the smiles to IUPAC name translator model

!ersilia serve $model_name #serving the model

model = ErsiliaModel(model_name)

output = model.api(input=smiles, output="pandas")

I then used the following code to get the SMILES and IUPAC names from the output file:

ersilia_iupac_names = output.iupacs_names.to_list()

print('IUPAC NAMES converted using Ersilia Model')
print('*'*20)
# Translate Forward
for item in zip(smiles, ersilia_iupac_names):
    print("IUPAC name of "+item[0]+" is: "+item[1])

The output is as follows:

IUPAC NAMES converted using Ersilia Model
********************
IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 is: [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
IUPAC name of C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 is: (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
IUPAC name of CC(=O)Nc1sc(nn1)[S](N)(=O)=O is: N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
IUPAC name of CC(O)=O is: aceticacid
IUPAC name of CC(=O)N[C@@H](CS)C(O)=O is: (2R)-2-acetamido-3-sulfanylpropanoicacid
IUPAC name of CC(=O)Oc1ccccc1C(O)=O is: 2-acetyloxybenzoicacid
IUPAC name of NC1=NC(=O)c2ncn(COCCO)c2N1 is: 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one
maureen-mugo commented 10 months ago
For better visualization of the output, we can use the following table: SMILES STOUT IUPAC NAME ERSILIA IUPAC NAME
Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
CC(=O)Nc1sc(nn1)[S](N)(=O)=O N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
CC(O)=O aceticacid aceticacid
CC(=O)N[C@@H](CS)C(=O)O (2R)-2-acetamido-3-sulfanylpropanoicacid (2R)-2-acetamido-3-sulfanylpropanoicacid
CC(=O)OC1=C(C=CC=C1)C(=O)O.CC(=O)OC1=C(C=CC=C1)C(=O)O 2-acetyloxybenzoicacid 2-acetyloxybenzoicacid
CNC1=NC(=O)C2=C(N1)N(C=N2)COCCO 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one

From the table above, we can note that the two models had the same IUPAC name for some SMILES and varying IUPAC names for other SMILES. The first three SMILES had different IUPAC names for the two models while the other four SMILES had the same IUPAC name.

DhanshreeA commented 10 months ago

Thank you for the updates @maureen-mugo. Could you also try using Stout version 2.0.1 as used here and then compare the results between direct STOUT output and the Ersilia model output?

maureen-mugo commented 10 months ago

Hi @DhanshreeA , I will certainly try STOUT version 2.0.1 and compare it with the Ersilia model. Then proceed with the other steps. Thanks for the feedback so far.

maureen-mugo commented 10 months ago

Installing STOUT version 2.0.1

I followed the following steps to install STOUT version 2.0.1:

  1. I removed the mamba environment that I previous created with STOUT version 2.0.6.
  2. I created a new STOUT version using mamba create --name stout python=3.8 then activated the environment using mamba activate stout.
  3. I then installed STOUT using pip install STOUT-pypi==2.0.1.
  4. Lastly, I installed its setup using pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git.

Running STOUT version 2.0.1
I used the following code to test the model against Essential Medical List:


import pandas as pd
from STOUT import translate_forward, translate_reverse
import STOUT

df = pd.read_csv('eml_canonical.csv')
smiles = df.smiles.to_list()[:7] # taking 7 sample SMILES for our example
iupac_names = [] # to store our converted IUPAC Names

print(f'Using STOUT version: {STOUT.__version__}')
print('Performing Translate Forward on our 7 sample SMILES')
print('*'*20)
# Translate Forward
for item in smiles:
    IUPAC_name = translate_forward(item)
    iupac_names.append(IUPAC_name) # add to our list
    print("IUPAC name of "+item+" is: "+IUPAC_name)

print('\n')
print('Performing Translate Reverse on our 7 sample IUPAC NAMES')
print('*'*20)
# Translate Reverse
for item in iupac_names:
    SMILES = translate_reverse(item)
    print("SMILES of "+ item +" is: "+SMILES)

I got the following output:

Using STOUT version: 2.0.0
Performing Translate Forward on our 7 sample SMILES
********************
IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 is: [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol
IUPAC name of C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 is: (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol
IUPAC name of CC(=O)Nc1sc(nn1)[S](N)(=O)=O is: N-[5-[amino(dioxo)-λ6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide
IUPAC name of CC(O)=O is: aceticacid
IUPAC name of CC(=O)N[C@@H](CS)C(O)=O is: (2R)-2-acetamido-3-sulfanylpropanoicacid
IUPAC name of CC(=O)Oc1ccccc1C(O)=O is: 2-acetyloxybenzoicacid
IUPAC name of NC1=NC(=O)c2ncn(COCCO)c2N1 is: 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one

Performing Translate Reverse on our 7 sample IUPAC NAMES
********************
SMILES of [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol is: CNC1=NC(C2=C(N1)N(C=N2)[C@H]3C=C[C@H](C3)CO)NC4CC4
SMILES of (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol is: C[C@@]12CC=CC(C3=CC=CN=C3)[C@H]1[C@@]4(C)CC[C@@H](CC4=C2C[C@H](CC=C5)O)O
SMILES of N-[5-[amino(dioxo)-λ6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide is: CC(=O)NC1=NN=C(S1)S(=O)(=O)N
SMILES of aceticacid is: CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC(=O)O.CC
SMILES of (2R)-2-acetamido-3-sulfanylpropanoicacid is: CC(=O)N[C@@H](CS)C(=O)O
SMILES of 2-acetyloxybenzoicacid is: CC(=O)OC1=C(C=CC=C1)C(=O)O.CC(=O)OC1=C(C=CC=C1)C(=O)O
SMILES of 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one is: CNC1=NC(=O)C2=C(N1)N(C=N2)COCCO

@DhanshreeA I installed STOUT version 2.0.1 as seen in this log file which has a list of packages on my stout environment
packages.log, and as in the log file stout is in version 2.0.1 but upon printing the version used for running my code, I found out that its running on version 2.0.0.

I proceeded to compare my results with the Ersilia model eos4se9:

SMILES STOUT IUPAC NAME ERSILIA IUPAC NAME
Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
CC(=O)Nc1sc(nn1)[S](N)(=O)=O N-[5-[amino(dioxo)-λ6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
CC(O)=O aceticacid aceticacid
CC(=O)N[C@@H](CS)C(=O)O (2R)-2-acetamido-3-sulfanylpropanoicacid (2R)-2-acetamido-3-sulfanylpropanoicacid
CC(=O)OC1=C(C=CC=C1)C(=O)O.CC(=O)OC1=C(C=CC=C1)C(=O)O 2-acetyloxybenzoicacid 2-acetyloxybenzoicacid
CNC1=NC(=O)C2=C(N1)N(C=N2)COCCO 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one

From the table above, only three IUPAC names from the STOUT model matched the results from the Ersilia model.
The first three SMILES and the last SMILES from the table had different IUPAC names for the STOUT and Ersilia model. The other three SMILES had the same IUPAC name for the two models.

maureen-mugo commented 10 months ago

WEEK 3

Task 11: Suggest a new model and document it

Model Name
LIMO: Latent Inceptionism for Targeted Molecule Generation

Model Publication
https://arxiv.org/abs/2206.09010
pdf: https://proceedings.mlr.press/v162/eckmann22a/eckmann22a.pdf

Source Code
https://github.com/Rose-STL-Lab/LIMO

Dataset
ZINC250k dataset

Licence
None

Model Description

LIMO, or Latent Inceptionism for Targeted Molecule Generation, is a generative model that can be used to generate drug-like molecules with high binding affinity to target proteins. It is based on the idea of using inceptionism to optimize the latent space of a variational autoencoder (VAE).

Inceptionism is a technique that is commonly used in image processing to generate images that are similar to a given input image, but with enhanced features. LIMO uses a similar approach to optimize the latent space of a VAE, but instead of generating images, it generates molecules.

The LIMO model works as follows:

It is trained on a dataset of known drug-like molecules. It uses a neural network to predict the binding affinity of a molecule to a target protein. It uses inceptionism to optimize the latent space of the VAE, such that it is more likely to generate molecules with high binding affinity to the target protein. It generates new molecules by sampling from the latent space of the VAE. The LIMO model has been shown to outperform state-of-the-art methods on the task of generating drug-like compounds with high binding affinity to target proteins. It can also be used to generate molecules with specific properties, such as high-water solubility or low toxicity.

Relevance to Ersilia Generating drug-like molecules with high binding affinity to target proteins is challenging in drug discovery. Existing methods, such as reinforcement learning and deep generative models, can be slow and computationally expensive. The authors propose a method called Latent Inceptionism on Molecules (LIMO) that accelerates molecule generation using a variational autoencoder-generated latent space and neural networks for property prediction. As demonstrated by experimental results and molecular dynamics-based calculations, LIMO outperforms existing techniques in generating drug-like compounds with high binding affinity.

The paper LIMO would be interesting to an Open-Source Initiative with the goal of generating low-cost drugs for global health because it presents a novel method for accelerating the generation of drug-like molecules with high binding affinity. By using Latent Inceptionism on Molecules (LIMO), researchers can generate new drug candidates more efficiently and at lower costs, which aligns with the objective of developing affordable medications. Additionally, LIMO outperforms existing techniques in generating compounds with high binding affinity, making it a promising tool for discovering effective drugs that could benefit underserved populations.

Ersilia's mission involves supporting research related to infectious and neglected diseases, and one of the key aspects of this research is drug discovery. LIMO is a generative model designed to generate drug-like molecules with high binding affinity to target proteins. This aligns perfectly with Ersilia's goal of accelerating research into new treatments and drugs for these diseases.

Language
Python 3.9

Python packages:
torch
pytorch-lightning==1.9.0
selfies
scipy
tqdm

Also install RDKit and Open Babel

DhanshreeA commented 10 months ago

Thank you for all your efforts and the updates @maureen-mugo well done!

maureen-mugo commented 10 months ago

Task 12: Suggest a new model and document it

Model Name ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation

Model Publication https://pubmed.ncbi.nlm.nih.gov/37744464/

Source Code https://github.com/gregory-kyro/ChemSpaceAL

Dataset ChemBL 33
GuacaMol v1
MOSES
BindingDB (08–2023)

Licence MIT licence

Model Description
ChemSpaceAL is an active learning methodology applied to the task of generating protein-specific molecules. Active learning is a machine learning technique that allows a model to learn with fewer training examples by querying the user for labels on the most informative examples. ChemSpaceAL works by:

  1. Pretraining of a Generative Model: The first step in ChemSpaceAL is to pretrain a generative model, such as a GPT-based model, on a large dataset of known protein-specific molecules. During this pretraining, the model learns the rules and structures of these molecules represented as SMILES strings.

  2. Generating Candidate Molecules: Once the generative model is trained, it is used to generate a large number of unique SMILES strings that represent candidate molecules. These candidates are generated based on the learned distribution of protein-specific molecules in chemical space. The goal is to create a diverse set of molecules that are likely to interact with the target protein.

  3. Molecular Descriptor Calculation: Molecular descriptors, which contain information about molecular topology, physical properties, and the presence of functional groups, are calculated for each of the generated molecules. These descriptors provide quantitative information about the chemical properties of the candidates.

  4. Chemical Space Proxy and Clustering: The generated SMILES strings are projected into a chemical space proxy, a representation of the chemical space where molecules exist. K-means clustering is applied to group the generated molecules that share similar properties. This clustering helps organize the candidates into groups based on their chemical characteristics.

  5. Docking and Scoring: A small number of molecules from each cluster are selected and docked with the target protein, such as the HNH domain of Cas9. The docking process evaluates how well the generated molecules interact with the protein target. A heuristic attractive interaction-based scoring function is used to score the protein-ligand complexes.

  6. Integration of Scores and Active Learning Training Set: The scores obtained from the docking and scoring process are then mapped back to the original clusters. Molecules that meet or exceed a specified threshold are included in the active learning (AL) training set. This training set is a subset of molecules selected based on their predicted ability to interact with the protein target.

  7. Model Refinement: The generative model is refined by fine-tuning it using the molecules in the AL training set. This process helps the model improve its ability to generate high-quality protein-specific molecules based on the user's feedback.

  8. Iterative Process: The entire process, from generating candidate molecules to refining the model, is repeated for multiple iterations. Each iteration guides the generation toward regions of chemical space that contain molecules with higher scores, ultimately leading to the production of high-quality protein-specific molecules.

ChemSpaceAL can be used for drug discovery by:

  1. Generate new drug candidates that are specifically targeted to different proteins. This can be done by training the model on a dataset of known protein-specific drug targets.
  2. Design new drugs that are more effective and have fewer side effects. This can be done by training the model on a dataset of known drugs and their side effects.
  3. Identify new drug targets that are involved in different diseases. This can be done by training the model on a dataset of known protein-specific diseases.

Relevance to Ersilia ChemSpaceAL is a promising solution that would capture the interest of an Open-Source Initiative aiming to develop low-cost drugs for global health. This innovative methodology accelerates the generation of drug-like molecules with high binding affinity to target proteins, addressing a critical challenge in drug discovery.

By employing ChemSpaceAL, researchers gain the ability to efficiently produce new drug candidates at reduced costs, aligning perfectly with the goal of creating affordable medications for underserved populations. Furthermore, ChemSpaceAL outperforms existing methods, ensuring that the generated molecules possess high binding affinity, which is crucial for the development of effective drugs. This tool is particularly relevant to Ersilia's mission, which involves supporting research on infectious and neglected diseases, as it offers a powerful means to expedite drug discovery efforts and potentially discover life-saving treatments for these diseases.

Installation
ChemSpaceAL can be installed by:
pip install ChemSpaceAL

git clone https://github.com/Liuhong99/Sophia.git

The requirements are here.

maureen-mugo commented 10 months ago

Task 13: Suggest a new model and document it

Model Name

MolGAN: An implicit generative model for small molecular graphs

Model Publication

https://arxiv.org/abs/1805.11973

Source Code

https://github.com/nicola-decao/MolGAN

Dataset

QM9

Licence

MIT licence

Model Description

MolGAN is a model that employs a graph-based approach, directly generating molecular graphs rather than relying on string-based representations like SMILES. MolGAN is based on generative adversarial networks (GANs), a deep learning framework that pits two neural networks against each other: a generator and a discriminator. The generator attempts to create realistic molecular graphs, while the discriminator tries to distinguish between real and generated graphs. Additionally, MolGAN is an implicit generative model, which means that it does not explicitly define the distribution of the data. Instead, it learns a mapping from a latent space to the data space. This mapping allows the model to generate new data samples by sampling from the latent space and then transforming the samples to the data space. MolGAN also uses a reinforcement learning objective to encourage the generation of molecules with specific desired properties. This is done by training a reward network to guide the generator towards producing molecules with specific properties, such as desired chemical characteristics.

MolGAN was evaluated on the QM9 chemical database, and it was shown to outperform other state-of-the-art generative models for molecular graphs such as ORGAN, OR(W)GAN and Naive RL. MolGAN was also able to generate molecules with specific desired properties, such as high solubility and low toxicity.

Relevance to Ersilia

MolGAN's ability to create molecular graphs with specific desired properties, such as low toxicity and high solubility, enhances the efficiency of drug discovery processes. This aligns closely with Ersilia's mission to advance research into treatments for infectious and neglected diseases. Also, MolGAN's implicit generative model for molecular graphs can significantly empower Ersilia's mission by providing a user-friendly, open-source tool for researchers in low-resourced areas, enabling them to accelerate their work in biochemistry, drug discovery, and infectious disease research.

Installation

The requirements to run MolGAN include:
python>=3.6
tensorflow>=1.7.0
rdkit
numpy
scikit-learn

Instructions on how to install and run the model have been provided here

maureen-mugo commented 10 months ago

@DhanshreeA I would appreciate some feedback so far on my submission.

GemmaTuron commented 10 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!