Week 1 - Get to know the community

[x] Join the communication channels
[x] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Installing Ersilia Model Hub I was able to successfully install the ersilia model hub without any issues. To test if it was installed successfully, I ran the following commands and here are the output

The output for ersilia --help Usage: ersilia [OPTIONS] COMMAND [ARGS]...

🦠 Welcome to Ersilia! 💊

Options: --version Show the version and exit. -v, --verbose Show logging on terminal when running commands. -s, --silent Do not echo any progress message. --help Show this message and exit.

Commands: api Run API on a served model auth Log in to ersilia to enter contributor mode. card Get model info card catalog List a catalog of models clear Clear ersilia close Close model current Get identifier of current model delete Delete model from local computer example Generate input examples for the model of interest fetch Fetch model from Ersilia Model Hub info Get model information run Run a served model sample Sample inputs and model identifiers serve Serve model test Test a model

Hi @Richiio thanks for the updates. Please proceed to test the simplest model (eos3b5e) as mentioned in the instructions and report your progress/any issues you run into here.

After successfully installing the model-eos3b5e using the following command: ersilia -v fetch eos3b5e ersilia serve eos3b5e and ersilia -v api run -i "CCCC"

and got the following output: { "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }

My motivation

Hello everyone, I'm Sarima Chiorlu. I am a machine learning engineer hoping to make contributions in medicine and drug discovery. Ersilia goal in making research for drugs easier for researchers in different regions was my major selling point to choosing to work with them as an intern. Using AI/ML, I can easily get information on things like protein binding, discover the elements present in a drug molecule and make research faster.

I come from Nigeria and apart from the debilitating state of our healthcare, we are often faced with disease challenges, whether it is global as in the COVID-19 or it is affects only our country like the ebola virus. Our government doesn't invest in the healthcare sector. However, through the work that you do, researchers can now easily through machine learning identify proteins, etc in drug substances.

To further understand why I am interested in this project, we see that Nigeria has a high number of Sickle cell individuals, where we see that what kills them are the silliest of disease. Dealing with a sickle cell relative is hardwork, this is because their white blood count is quite low and they are prone to the simplest of disease. But your work in not only investing in drug discovery but drug discovery in regions ensures that these people stand a higher chance. I admire how Ersilia ensures that knowledge is not just transferred but also cultivated within these regions, fostering lasting impact. I wish to learn more about drug discovery while still working on your project and tasks. I firmly believe that the democratization of scientific tools and data-driven insights is necessary towards addressing some of the world's most pressing challenges.

I just wish to use my little knowledge to contribute to this vision and goal. I look forward to contributing to this project during my internship and also outside of the internship.

First contribution just made

I'm open to helping testing out some models where necessary.

Hi @Richiio, thanks for the detailed updates. I do not see your contribution on the Outreachy website. Please update that as well. Afterwards, you can go ahead and start looking at Week 2 tasks.

Hi @DhanshreeA, it's been updated. I was still working on putting it together. For week 2, I have decided to work on the Smiles-to-iUPAC translator which can be found here

Thanks for the updates @Richiio

Smiles to IUPAC Translator

Brief Background SMILES: which stands for Simplified Molecular Input Line Entry System. It is a widely used in the field of drug discovery and chemistry for representing chemical structures and molecules. IUPAC: Our good old way of representing organic compounds (take a step back to organic chemistry in secondary school, most of us hated naming it, we know.). You take into account, the bonds present in these compounds. However, generally, it helps researchers categorize and analyze drugs within specific chemical classes, which can be important for understanding their properties and potential applications.

To begin, I am working with Google Colab, the implementation is straight-forward, and a lot of dependencies are handled for you. I first ran the command: !pip install STOUT-pypi This installs the STOUT-pypi into our terminal runtime. It should be noted that after running this command, you have to re-start your runtime to ensure all old dependencies that were uninstalled are no longer there, this is because if you don't you would run into errors along the way Next, to test my installation, I ran the command :

from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

And got the following Output:

Downloading trained model to /root/.data/STOUT-V2/models
/root/.data/STOUT-V2/models.zip
... done downloading trained model!
IUPAC name of CN1C=NC2=C1C(=O)N(C(=O)N2C)C is: 1,3,7-trimethylpurine-2,6-dione
SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C

This showed that all was working as they should.

Using the dataset provided for us in the internship handbook, the Essentials Medicine List which can be found here

To get started, I began by first saving the dataset to my local machine. This can be achieved in two ways:

Clicking on the link, it opens then you proceed to right-clicking on the file and click on save As. Provide a name and Click Save (which is the easiest approach) or
Downloading it directly from ersilia repo, Navigate to ersilia-os, then to ersilia then to notebooks. You would see the dataset, once open, there is a download button on the top-right area. You click download and you have your dataset.

I began by first running predictions on the main model. I had wanted to create a new column to save the can_smiles_to_IUPAC and smiles_to_IUPAC values for all the entries in the dataset. However, due to how large the dataset was. it took a long time to provide results. A time(>6hrs). I stopped the process and proceeded to taking the first 20 rows in my data. This time, I saved the smiles_to_IUPAC values in a new dataset which I called smiles and the can_smiles_to_IUPAC into a new dataset called can_smiles. The code I used can be found here:

import pandas as pd
from STOUT import translate_forward

# Loading data from the CSV file
df = pd.read_csv('/content/eml_canonical.csv')

# Selecting the first 20 rows
df = df.head(20)

# Create empty DataFrames for SMILES-to-IUPAC and CAN-SMILES-to-IUPAC translations
smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'smiles', 'iupac_name'])
can_smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'can_smiles', 'iupac_name'])

# Translate SMILES to IUPAC and append to smiles_IUPAC_name
for index, row in df.iterrows():
    iupac_name = translate_forward(row['smiles'])
    smiles_IUPAC_name = smiles_IUPAC_name.append({'drugs': row['drugs'], 'smiles': row['smiles'], 'iupac_name': iupac_name}, ignore_index=True)

# Translate CAN-SMILES to IUPAC and append to can_smiles_IUPAC_name
for index, row in df.iterrows():
    iupac_name = translate_forward(row['can_smiles'])
    can_smiles_IUPAC_name = can_smiles_IUPAC_name.append({'drugs': row['drugs'], 'can_smiles': row['can_smiles'], 'iupac_name': iupac_name}, ignore_index=True)

# Save the DataFrames to Excel files
smiles_IUPAC_name.to_excel('smiles.xlsx', index=False)
can_smiles_IUPAC_name.to_excel('can_smiles.xlsx', index=False)

The output files generated can be found below: can_smiles.xlsx smiles.xlsx

To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below:

first_model_log.txt

After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue.

@DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)

After I had successfully fetched the model and served it. I began running predictions on my Essential Medicines List dataset. Before I began, I created a folder within my ersilia I cloned which I named data, I copy-pasted the datasets I got from my earlier generated output. The can_smiles and smiles dataset. This dataset was initially saved in microsoft excel spreadsheet format, this had to be changed as Ersilia only supports comma separated(csv) files, json files, .tsv or .hdf5.

I ran this commands to generate the output:

ersilia api run -i smiles.csv -o output.csv This produced the output below: output.csv - Ersilia's prediction for smiles_to_IUPAC

ersilia api run -i can_smiles.csv -o output2.csv This produced the output below: output2.csv - Ersilia's prediction for can_smiles_to_IUPAC

From the outputs generated from the original code and Ersilia's model implementation. Some are slightly close, some are different while some are the same. Let's have a look at the first 4 smiles from the Essential Medicine List to have a look at the research.

For the smiles, we have: Drug: abacavir Original code: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol Ersilia:

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}

Drug: abiraterone Original code: (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol Ersilia:

{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
        ]
    }
}

Drug: acetazolamide Original code: N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide Ersilia:

{
    "input": {
        "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
        "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
        "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
    },
    "output": {
        "outcome": [
            "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
        ]
    }
}

Drug: acetic acid Original code: aceticacid Ersilia:

{
    "input": {
        "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
        "input": "CC(O)=O",
        "text": "CC(O)=O"
    },
    "output": {
        "outcome": [
            "aceticacid"
        ]
    }
}

@DhanshreeA Are the results not meant to be the same consistently?

Install and run docker

This was my first time working with docker. So I began with the installation.

# Updating package lists
sudo apt update

Next, I proceeded with installing the required packages to set up the Docker repository

sudo apt install -y apt-transport-https ca-certificates curl software-properties-common

Next, you add the Docker's official GPG key

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Next, I added the Docker repository to APT sources

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Then we Install Docker Engine

sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io

Start the Docker service

sudo service docker start

I'm using windows specifically Ubuntu 22.04

To confirm my docker installation, I ran the following command docker ps and got the result shown below:

(base) root@Richio:~# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

This showed that I had no container running presently on docker. Next, I decided to try pulling a model from Ersilia's

(base) root@Richio:~# docker pull ersiliaos/eos4se9
Using default tag: latest
latest: Pulling from ersiliaos/eos4se9
8b91b88d5577: Pull complete
824416e23423: Pull complete
bbe2c2981082: Pull complete
7b6b68d15a5c: Pull complete
71f8f4db541d: Pull complete
4f4fb700ef54: Pull complete
278266b40c52: Pull complete
4298588a86ad: Pull complete
dddca77c0f59: Pull complete
a113a2030c72: Pull complete
0c8571d61669: Pull complete
Digest: sha256:3c0b4dab7a313bfb33c74b45ca378f7d69b0b9dbaaf843357780180910af31ab
Status: Downloaded newer image for ersiliaos/eos4se9:latest
docker.io/ersiliaos/eos4se9:latest

Then I ran:

(base) root@Richio:~# docker run ersiliaos/eos4se9

I reran the command docker ps and got the following output

(base) root@Richio:~# docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED          STATUS          PORTS     NAMES
d6e3b2b2b5fd   ersiliaos/eos4se9   "sh /root/docker-ent…"   48 seconds ago   Up 13 seconds   80/tcp    pedantic_antonelli

Smiles2IUPAC translator model Explanation

Taking a look at the publication, the following was extracted: "Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds."

This explains the STOUT model to be a smart tool, we use to translate chemical codes(called SMILES) into human-readable chemical names and vice versa. To check how well it does the translations, we use a score called BLEU as mentioned above, this tells us if the translations are accurate. An impressive BLEU score of 90% means it gets 90% of the translations right and a similarity index of 0.9 means that even when it makes mistakes, they are very similar to the correct answers. How does it do this? It breaks down the chemical codes into smaller pieces, sort of like words in a sentence. Then, it uses a neural network to make the translations.

Understanding Ersilia's Backend

Ersilia gathers models from research papers, conferences, etc and stores them either as docker containers, in GitHub repos and on AWS. In GitHub repos which is what has been in use until the latest addition of docker containers. As GitHub repos, once ersilia has been installed to your local machine. You can run predictions on models by simply fetching the models and running predictions. Using docker, once you have docker installed, you can fetch the model from Ersilia's dockerhub using the docker fetch command and running the container on your local machine. With that you can easily make predictions the same way you do with GitHub. Using AWS, my personal opinion is that the models would be stored in a warehouse, most likely AWS as S3 buckets. Once you connect to an S3 bucket, you can pull the model you want to your local repo and make predictions with it.

ImageMol Installation guide

@DhanshreeA mentioned we can use CPU instead of GPU or CUDA

Creating our ImageMol environment and activating it Step 1: conda create -n imagemol python=3.7.3\ Step 2: conda activate imagemol Step 3: conda install -c rdkit rdkit

Install Torch, Torchvision. We picked the package compatible with the cpu Step 4: pip install https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl Step 5: pip install https://download.pytorch.org/whl/cpu/torchvision-0.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl Step 6: pip install torch torchvision torchaudio

Installing the required packages Step 7: pip install torch-scatter Step 8: pip install torch-cluster Step 9: pip install torch-sparse Step 10: pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0.html Step 11: mkdir ckpts Step 12: wget -P ckpts/ https://github.com/HongxinXiang/ImageMol/blob/master/ckpts/pretraining-toy/checkpoints/ImageMol_10.pth.tar Step 13: sudo apt update Step 14: sudo apt install libxrender1 Step 15: git clone https://github.com/HongxinXiang/ImageMol.git Step 16: cd ImageMol, run pip install -r requirements.txt, cd datasets, cd finetuning Step 17: Pick a dataset you want to finetune on and download it into the finetuning folder, we would be working with the SARS-CoV-2. Step 18: pip install gdown Step 19: gdown --id 1UfROoqR_aU6f5xWwxpLoiJnkwyUzDsui #(id of the google file, this is basically the value between the d/ and before the /view Step 20: mkdir SARS-CoV-2 Step 21: tar -xzvf SARS-CoV-2.tar.gz -C SARS-CoV-2

Finetuning our dataset on the pretrained model Step 22:

python finetune.py --gpu 0 \
                   --save_finetune_ckpt 1 \
                   --log_dir ./logs/toxcast \
                   --dataroot ./datasets/finetuning/SARS-CoV-2/SARS-CoV-2 \
                   --dataset 3CL_enzymatic_activity\
                   --task_type classification \
                   --resume ./ckpts/ImageMol.pth.tar \
                   --image_aug \
                   --lr 0.5 \
                   --batch 64 \
                   --epochs 20

You should get an output similar to this:

final results: highest_valid: 0.565, final_train: 0.521, final_test: 0.271

The full output is here:

(imagemol) root@Richio:~/ImageMol# python finetune.py --gpu 0 \
                   --save_finetune_ckpt 1 \
                   --log_dir ./logs/toxcast \
                   --dataroot ./datasets/finetuning/SARS-CoV-2/SARS-CoV-2 \
                   --dataset 3CL_enzymatic_activity\
                   --task_type classification \
                   --resume ./ckpts/ImageMol.pth.tar \
                   --image_aug \
                   --lr 0.5 \
                   --batch 64 \
                   --epochs 20
Warning: There's no GPU available on this machine, training will be performed on CPU.
Architecture: ResNet18
eval_metric: rocauc
/root/miniconda3/envs/imagemol/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, "
/root/miniconda3/envs/imagemol/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
=> no checkpoint found at './ckpts/ImageMol.pth.tar'
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1, bias=True)
)
params: {'total_params': 11177025, 'total_trainable_params': 11177025}
[train epoch 0] loss: 10.069: 100%|███████████████████████████████████████████████████████| 5/5 [03:26<00:00, 41.39s/it]
[valid epoch 0] loss: 1774764032.000: 100%|███████████████████████████████████████████████| 5/5 [00:15<00:00,  3.03s/it]
[valid epoch 0] loss: 1882428160.000: 100%|███████████████████████████████████████████████| 1/1 [00:03<00:00,  3.36s/it]
[valid epoch 0] loss: 708053952.000: 100%|████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.81s/it]
{'epoch': 0, 'patience': 0, 'Loss': 1774764032.0, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 1] loss: 14.596: 100%|███████████████████████████████████████████████████████| 5/5 [01:43<00:00, 20.72s/it]
[valid epoch 1] loss: 12880836.800: 100%|█████████████████████████████████████████████████| 5/5 [00:18<00:00,  3.68s/it]
[valid epoch 1] loss: 11979074.000: 100%|█████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.99s/it]
[valid epoch 1] loss: 4480956.000: 100%|██████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.98s/it]
{'epoch': 1, 'patience': 0, 'Loss': 12880836.8, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 2] loss: 1.059: 100%|████████████████████████████████████████████████████████| 5/5 [00:35<00:00,  7.05s/it]
[valid epoch 2] loss: 122073.475: 100%|███████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.29s/it]
[valid epoch 2] loss: 99631.984: 100%|████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.05s/it]
[valid epoch 2] loss: 34290.859: 100%|████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.93s/it]
{'epoch': 2, 'patience': 1, 'Loss': 122073.475, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 3] loss: 0.584: 100%|████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.14s/it]
[valid epoch 3] loss: 1309.997: 100%|█████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.50s/it]
[valid epoch 3] loss: 304.833: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.94s/it]
[valid epoch 3] loss: 94.451: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.82s/it]
{'epoch': 3, 'patience': 2, 'Loss': 1309.99697265625, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 4] loss: 0.478: 100%|████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.35s/it]
[valid epoch 4] loss: 351.336: 100%|██████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.15s/it]
[valid epoch 4] loss: 2.393: 100%|████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
[valid epoch 4] loss: 1.401: 100%|████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.68s/it]
{'epoch': 4, 'patience': 3, 'Loss': 351.3359375, 'Train': 0.5211365211365212, 'Validation': 0.5648148148148149, 'Test': 0.2708333333333333}
[train epoch 5] loss: 0.538: 100%|████████████████████████████████████████████████████████| 5/5 [00:39<00:00,  7.86s/it]
[valid epoch 5] loss: 466.242: 100%|██████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.47s/it]
[valid epoch 5] loss: 835.374: 100%|██████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.07s/it]
[valid epoch 5] loss: 426.533: 100%|██████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.08s/it]
{'epoch': 5, 'patience': 0, 'Loss': 466.24150390625, 'Train': 0.49774774774774777, 'Validation': 0.5, 'Test': 0.5}
[train epoch 6] loss: 0.589: 100%|████████████████████████████████████████████████████████| 5/5 [00:35<00:00,  7.05s/it]
[valid epoch 6] loss: 672.822: 100%|██████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.23s/it]
[valid epoch 6] loss: 1352.755: 100%|█████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.90s/it]
[valid epoch 6] loss: 629.823: 100%|██████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.05s/it]
{'epoch': 6, 'patience': 1, 'Loss': 672.82197265625, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 7] loss: 0.537: 100%|████████████████████████████████████████████████████████| 5/5 [00:33<00:00,  6.79s/it]
[valid epoch 7] loss: 442.793: 100%|██████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.27s/it]
[valid epoch 7] loss: 890.216: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.81s/it]
[valid epoch 7] loss: 417.324: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.62s/it]
{'epoch': 7, 'patience': 2, 'Loss': 442.793310546875, 'Train': 0.49774774774774777, 'Validation': 0.5, 'Test': 0.5}
[train epoch 8] loss: 0.475: 100%|████████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.43s/it]
[valid epoch 8] loss: 263.574: 100%|██████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.20s/it]
[valid epoch 8] loss: 549.651: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.68s/it]
[valid epoch 8] loss: 269.386: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.75s/it]
{'epoch': 8, 'patience': 3, 'Loss': 263.57353515625, 'Train': 0.48873873873873874, 'Validation': 0.5, 'Test': 0.5}
[train epoch 9] loss: 0.489: 100%|████████████████████████████████████████████████████████| 5/5 [00:33<00:00,  6.64s/it]
[valid epoch 9] loss: 117.936: 100%|██████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.23s/it]
[valid epoch 9] loss: 245.917: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.74s/it]
[valid epoch 9] loss: 132.232: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.80s/it]
{'epoch': 9, 'patience': 4, 'Loss': 117.93641357421875, 'Train': 0.4509702009702009, 'Validation': 0.4814814814814815, 'Test': 0.5}
[train epoch 10] loss: 0.505: 100%|███████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.96s/it]
[valid epoch 10] loss: 29.749: 100%|██████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.21s/it]
[valid epoch 10] loss: 70.138: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.78s/it]
[valid epoch 10] loss: 45.388: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
{'epoch': 10, 'patience': 5, 'Loss': 29.749237060546875, 'Train': 0.4222106722106722, 'Validation': 0.48842592592592593, 'Test': 0.359375}
[train epoch 11] loss: 0.497: 100%|███████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.56s/it]
[valid epoch 11] loss: 3.561: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.15s/it]
[valid epoch 11] loss: 15.333: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
[valid epoch 11] loss: 11.147: 100%|██████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.84s/it]
{'epoch': 11, 'patience': 6, 'Loss': 3.5606117248535156, 'Train': 0.46197158697158697, 'Validation': 0.4444444444444444, 'Test': 0.21875}
[train epoch 12] loss: 0.497: 100%|███████████████████████████████████████████████████████| 5/5 [00:33<00:00,  6.60s/it]
[valid epoch 12] loss: 0.636: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.19s/it]
[valid epoch 12] loss: 2.595: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.71s/it]
[valid epoch 12] loss: 2.038: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.73s/it]
{'epoch': 12, 'patience': 7, 'Loss': 0.6362054824829102, 'Train': 0.4875693000693001, 'Validation': 0.3819444444444444, 'Test': 0.23958333333333331}
[train epoch 13] loss: 0.457: 100%|███████████████████████████████████████████████████████| 5/5 [00:33<00:00,  6.64s/it]
[valid epoch 13] loss: 0.488: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.07s/it]
[valid epoch 13] loss: 0.670: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.64s/it]
[valid epoch 13] loss: 0.460: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.68s/it]
{'epoch': 13, 'patience': 8, 'Loss': 0.4882702350616455, 'Train': 0.5756670131670132, 'Validation': 0.43287037037037035, 'Test': 0.38541666666666663}
[train epoch 14] loss: 0.502: 100%|███████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.38s/it]
[valid epoch 14] loss: 0.447: 100%|███████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.21s/it]
[valid epoch 14] loss: 0.558: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.72s/it]
[valid epoch 14] loss: 0.317: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.75s/it]
{'epoch': 14, 'patience': 9, 'Loss': 0.44681572914123535, 'Train': 0.4993069993069993, 'Validation': 0.36342592592592593, 'Test': 0.3541666666666667}
[train epoch 15] loss: 0.506: 100%|███████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.42s/it]
[valid epoch 15] loss: 0.514: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.17s/it]
[valid epoch 15] loss: 0.539: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.98s/it]
[valid epoch 15] loss: 0.344: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.67s/it]
{'epoch': 15, 'patience': 10, 'Loss': 0.5143370151519775, 'Train': 0.5172383922383922, 'Validation': 0.4745370370370371, 'Test': 0.35416666666666663}
[train epoch 16] loss: 0.481: 100%|███████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.48s/it]
[valid epoch 16] loss: 0.508: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.02s/it]
[valid epoch 16] loss: 0.538: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.76s/it]
[valid epoch 16] loss: 0.372: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.76s/it]
{'epoch': 16, 'patience': 11, 'Loss': 0.507773494720459, 'Train': 0.48128898128898123, 'Validation': 0.4745370370370371, 'Test': 0.35416666666666663}
[train epoch 17] loss: 0.474: 100%|███████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.38s/it]
[valid epoch 17] loss: 0.484: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.05s/it]
[valid epoch 17] loss: 0.548: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.95s/it]
[valid epoch 17] loss: 0.323: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
{'epoch': 17, 'patience': 12, 'Loss': 0.4842677116394043, 'Train': 0.46664934164934163, 'Validation': 0.45833333333333337, 'Test': 0.35416666666666663}
[train epoch 18] loss: 0.460: 100%|███████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.12s/it]
[valid epoch 18] loss: 0.472: 100%|███████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.11s/it]
[valid epoch 18] loss: 0.559: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.71s/it]
[valid epoch 18] loss: 0.312: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.68s/it]
{'epoch': 18, 'patience': 13, 'Loss': 0.47157673835754393, 'Train': 0.5103950103950103, 'Validation': 0.5, 'Test': 0.5}
[train epoch 19] loss: 0.499: 100%|███████████████████████████████████████████████████████| 5/5 [00:33<00:00,  6.68s/it]
[valid epoch 19] loss: 0.508: 100%|███████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.41s/it]
[valid epoch 19] loss: 0.549: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.98s/it]
[valid epoch 19] loss: 0.321: 100%|███████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.80s/it]
{'epoch': 19, 'patience': 14, 'Loss': 0.5075236320495605, 'Train': 0.5150294525294525, 'Validation': 0.5, 'Test': 0.5}
final results: highest_valid: 0.565, final_train: 0.521, final_test: 0.271

849

I forgot to add, you'll have to edit the finetune.py file from model=model.cuda() to model=model. This is because we are not using cuda

To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below:

first_model_log.txt

After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue.

@DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)

Hi @Richiio thank you for your suggestion. I should point out that the two models are different in that they do the opposite of each other. eos4se9 takes SMILES inputs and generates IUPAC names (text outputs), whereas eos5ecc takes in IUPAC names (text inputs) and generates SMILES. Incorporating text inputs within Ersilia is still very much WIP hence this model is in the backlog to be implemented.

@DhanshreeA Are the results not meant to be the same consistently?

@Richiio could you report which version of STOUT you are using? Ersilia is currently using 2.0.1. However the latest version of STOUT seems to be 2.0.6. If the base model within this package has been updated (whether in terms of its architecture or the training dataset), it is possible that there are differences across the two versions. We will have to see which is more accurate.

@Richiio I also noticed that you are working with models on Google Colab, could you mention why? Were there issues running them on your machine?

Thanks @DhanshreeA for the updates! I did encounter two challenges when installing the model. However, the challenges encountered occurred when I was working outside the conda environment. All worked fine within the conda environment apart from the OSError: [Errno 0] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly error which you would encounter if you do not have Java installed and added to your environment path and if you decide to use the conda install -c decimer stout-pypi. I created the STOUT environment and installed STOUT-pypi using the pip command instead of conda

The JVM error, I encountered was resolved by installing java using the command sudo apt install openjdk-8-jre and setting JAVA to the environment variable which was done using the following commands: nano ~/.bashrc which opened the shell profile file in text editor and added the following command at the end of the file export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 . I saved the file and applied the changes using this command source ~/.bashrc

I created a file which I named sarima.py inside a folder I created, downloaded the dataset to the same file location as the python file and ran nano sarima.py, this opened the shell and I pasted the following code:

import pandas as pd
from STOUT import translate_forward

# Loading data from the CSV file (using a relative path)
csv_file_path = 'eml_canonical.csv'
df = pd.read_csv(csv_file_path)

# Selecting the first 20 rows
df = df.head(20)

# Create empty DataFrames for SMILES-to-IUPAC and CAN-SMILES-to-IUPAC translations
smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'smiles', 'iupac_name'])
can_smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'can_smiles', 'iupac_name'])

# Translate SMILES to IUPAC and concatenate with smiles_IUPAC_name
for index, row in df.iterrows():
    iupac_name = translate_forward(row['smiles'])
    smiles_IUPAC_name = pd.concat([smiles_IUPAC_name, pd.DataFrame({'drugs': [row['drugs']], 'smiles': [row['smiles']], 'iupac_name': [iupac_name]})], ignore_index=True)

# Translate CAN-SMILES to IUPAC and concatenate with can_smiles_IUPAC_name
for index, row in df.iterrows():
    iupac_name = translate_forward(row['can_smiles'])
    can_smiles_IUPAC_name = pd.concat([can_smiles_IUPAC_name, pd.DataFrame({'drugs': [row['drugs'], 'can_smiles': [row['can_smiles']], 'iupac_name': [iupac_name]})], ignore_index=True)

# Save the DataFrames to CSV files
smiles_IUPAC_name.to_csv('smiles.csv', index=False)
can_smiles_IUPAC_name.to_csv('can_smiles.csv', index=False)

saved it and exited from the shell. I installed pandas and ran python sarima.py. This ran and produced the following output can_smiles.csv smiles.csv

I decided to work with Colab as it as a lot straightforward and faster

@Richiio I also noticed that you are working with models on Google Colab, could you mention why? Were there issues running them on your machine?

Even using the above installation, the output was different to that of Ersilia's Model Hub

Model Version

(STOUT) root@Richio:~/Test#  pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.5
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /root/miniconda3/envs/STOUT/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:

Ersilia's version:

(ersilia) root@Richio:~/Test# conda list
# packages in environment at /root/miniconda3/envs/ersilia:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
alembic                   1.12.0                   pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
bentoml                   0.11.0                   pypi_0    pypi
blinker                   1.6.2                    pypi_0    pypi
boto3                     1.28.61                  pypi_0    pypi
botocore                  1.31.61                  pypi_0    pypi
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2023.7.22            hbcca054_0    conda-forge
cerberus                  1.3.5                    pypi_0    pypi
certifi                   2023.7.22                pypi_0    pypi
chardet                   5.2.0                    pypi_0    pypi
charset-normalizer        3.3.0                    pypi_0    pypi
chembl-webresource-client 0.10.8                   pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
docker                    6.1.3                    pypi_0    pypi
dockerfile-parse          2.0.1                    pypi_0    pypi
easydict                  1.10                     pypi_0    pypi
emoji                     2.8.0                    pypi_0    pypi
ersilia                   0.1.27                   pypi_0    pypi

To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below: first_model_log.txt After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue. @DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)

Hi @Richiio thank you for your suggestion. I should point out that the two models are different in that they do the opposite of each other. eos4se9 takes SMILES inputs and generates IUPAC names (text outputs), whereas eos5ecc takes in IUPAC names (text inputs) and generates SMILES. Incorporating text inputs within Ersilia is still very much WIP hence this model is in the backlog to be implemented.

Thanks for the clarification!

Even using the above installation, the output was different to that of Ersilia's Model Hub

Model Version

(STOUT) root@Richio:~/Test#  pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.5
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /root/miniconda3/envs/STOUT/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:

Ersilia's version:

(ersilia) root@Richio:~/Test# conda list
# packages in environment at /root/miniconda3/envs/ersilia:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
alembic                   1.12.0                   pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
bentoml                   0.11.0                   pypi_0    pypi
blinker                   1.6.2                    pypi_0    pypi
boto3                     1.28.61                  pypi_0    pypi
botocore                  1.31.61                  pypi_0    pypi
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2023.7.22            hbcca054_0    conda-forge
cerberus                  1.3.5                    pypi_0    pypi
certifi                   2023.7.22                pypi_0    pypi
chardet                   5.2.0                    pypi_0    pypi
charset-normalizer        3.3.0                    pypi_0    pypi
chembl-webresource-client 0.10.8                   pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
docker                    6.1.3                    pypi_0    pypi
dockerfile-parse          2.0.1                    pypi_0    pypi
easydict                  1.10                     pypi_0    pypi
emoji                     2.8.0                    pypi_0    pypi
ersilia                   0.1.27                   pypi_0    pypi

@Richiio, thank you for the progress. Kindly note that with your command, you're checking for the package versions in the ersilia environment. For model-specific packages, you can check for them in the Docker file of that particular model. navigate to the repository of that model in the ersilia-os organization on GitHub. Forexample, for model eos4se9, the package versions are specified in the dockerfile

You will notice that ersilia's model uses version STOUT 2.0.1. So for comparison, try installing the same version using the command pip install STOUT-pypi==2.0.1

@HellenNamulinda Thanks for the correction! I created a new STOUT environment which I called STOUT2, activated it and ran the command pip install STOUT-pypi==2.0.1, downloaded the dataset and ran my python script. The number of SMILES to IUPAC predicted correctly improved but we still had some incorrect predictions. Those incorrect predictions, to confirm which was right and wrong. I went to pubchem and searched for the drug using its smile which showed the correct IUPAC name as seen here . The result corresponded more with that of the original model(version 2.0.5). Version 2.0.1 of the original model was also not a 100% correct, although most were correct.

drugs | smiles | iupac_name -- | -- | -- abacavir | Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 | [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol abiraterone | C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol acetazolamide | CC(=O)Nc1sc(nn1)[S](N)(=O)=O | N-[5-[amino(dioxo)-λ6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide acetic acid | CC(O)=O | aceticacid acetylcysteine | CC(=O)N[C@@H](CS)C(O)=O | (2R)-2-acetamido-3-sulfanylpropanoicacid acetylsalicylic acid | CC(=O)Oc1ccccc1C(O)=O | 2-acetyloxybenzoicacid aciclovir | NC1=NC(=O)c2ncn(COCCO)c2N1 | 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one aclidinium | OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 | 2-[[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy]-1,1-dithiophen-2-ylethanol amlodipine | CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C | 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(6-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate amodiaquine | CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O | 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)cyclohexa-1,3,5-trien-1-ol

The output files can_smiles.csv smiles.csv

WEEK 3

Model Suggestions

Model 1

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Model Description DiffDock is a model for molecular docking in drug discovery. It takes a fresh approach by treating docking as a generative modeling problem, which helps it achieve better accuracy. DiffDock efficiently explores the possibilities of how small molecules bind to proteins by using a unique diffusion process. DiffDock outperforms traditional docking methods and deep learning techniques, achieving a 38% success rate on the PDBBind dataset, even excelling in computationally folded structures.

Model Identifier Slug: DiffDock

Model Characteristics Input: Ligand poses Task: Molecular docking Tag: Drug discovery, Protein-ligand interaction Output: Binding structure, Score

References Source Code Publication

License: MIT License

Model 2

SELFormer: Molecular Representation Learning via SELFIES Language Models

Model description We've been used to SMILES as our form of input in predicting aqueous solubility of compounds. SELFormer is a transformer based architecture that uses SELFIES as input in order to learn the molecular representations of drugs. You start by first converting your SMILES input to SELFIES, a 100% valid, compact, and expressive notation, as input to learn molecular representations for drug discovery and development. SELFIES, compared to the widely used SMILES notation, overcomes various issues, such as non-canonical representations and the inability to capture spatial information. It pre-trains SELFormer on two million drug-like compounds and fine-tunes it for various molecular property prediction tasks.

Model Identifier Slug: SELFormer

Model Characteristics Input: SMILES Task: Molecular property prediction, drug discovery Tag: Molecular representation, SELFIES, transformer architecture Output: Predicted molecular properties

References Source code Publication

Model 3

Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery

Model Description Drug discovery is a costly and time-intensive endeavor, hindered by the rapid growth of chemical databases. To tackle this challenge, Deep Docking (DD) is introduced as a deep learning platform capable of rapidly and accurately docking billions of molecular structures. DD employs quantitative structure-activity relationship (QSAR) models trained on subsets of a chemical library to predict docking outcomes, efficiently removing unfavorable molecules iteratively. When combined with the FRED docking program, DD calculates docking scores for 1.36 billion molecules, achieving remarkable data reduction and enrichment of high-scoring molecules without sacrificing favorable results. DD's flexibility allows integration with any docking program, and its resources are publicly available.

Model Identifier Slug: DeepDock

Model Characteristics Input: SMILES Task: Molecular docking Tag: Drug discovery Output: Docked SMILES

References Source Code Publication

License: MIT License

Model

DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks

Model description This model aims to address the challenge of predicting compound-protein interactions (CPI) with high applicability, accuracy, and interpretability using sequence data alone. Drug discovery relies on understanding how molecules interact with proteins, and computational methods can accelerate this process. However, predicting compound-protein affinity from sequences has been limited in scope and interpretability. DeepAffinity proposes a way leveraging both labelled and unlabeled data for encoding molecular representations and predicting affinities. They use specialized representations of protein sequences, such as secondary structure predictions, and train a deep learning model that unifies recurrent and convolutional neural networks (CNNs). The model achieves impressive accuracy, with a relative error within 5-fold for test cases and 20-fold for new protein classes. It incorporates separate and joint attention mechanisms for interpretability, allowing them to predict and explain selective drug-target interactions.

Model Identifier Slug: DeepAffinity

Model Characteristics Input: SMILES Tag: Compound-Protein interaction, drug discovery Output: Predicted compound-protein affinity

References Source Code Publication

License: GPL-3.0 License

@HellenNamulinda Thanks for the correction! I created a new STOUT environment which I called STOUT2, activated it and ran the command pip install STOUT-pypi==2.0.1, downloaded the dataset and ran my python script. The number of SMILES to IUPAC predicted correctly improved but we still had some incorrect predictions. Those incorrect predictions, to confirm which was right and wrong. I went to pubchem and searched for the drug using its smile which showed the correct IUPAC name as seen here . The result corresponded more with that of the original model(version 2.0.5). Version 2.0.1 of the original model was also not a 100% correct, although most were correct.

drugs smiles iupac_name abacavir Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol abiraterone C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol acetazolamide CC(=O)Nc1sc(nn1)S(=O)=O N-[5-[amino(dioxo)-λ6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide acetic acid CC(O)=O aceticacid acetylcysteine CC(=O)NC@@HC(O)=O (2R)-2-acetamido-3-sulfanylpropanoicacid acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O 2-acetyloxybenzoicacid aciclovir NC1=NC(=O)c2ncn(COCCO)c2N1 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one aclidinium OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 2-[[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy]-1,1-dithiophen-2-ylethanol amlodipine CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(6-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate amodiaquine CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)cyclohexa-1,3,5-trien-1-ol

The output files can_smiles.csv smiles.csv

Hi @Richiio many thanks for your efforts, especially for validating these compounds against PubChem- great effort! This is very helpful for us. As a bonus task (please note that it will not have any effect on your application if you cannot complete it), could you run 2.0.1 (the version that Ersillia uses), 2.0.5, and 2.0.6 on the EML file, and report the results in a csv. You can keep the columns as (smiles, ver_201, ver_205, ver_206), and if the time permits, please add an extra column for what PubChem has to say about these molecules. It will be very useful for us. Again, only if the time permits. It's a bonus task, and not required for your application.

@Richiio Additionally, thank you for the model suggestions, you can mark week 3 tasks as completed. :)

Thanks so much @DhanshreeA for getting back. For the extra task, I created a third environment which I called STOUT3 and installed STOUT-pypi using the pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git. To confirm I was using the right version. I ran the pip show command and got the following output

(STOUT3) root@Richio:~# pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.6
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /root/miniconda3/envs/STOUT3/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:

I created the columns for the various versions, retrieved the csv file, then proceeded to excel to create columns for the PubChem reference, I copy-pasted the SMILES and got their various IUPAC names. The final csv compilation can be found here:

Final Result - smiles (1).csv.csv

The version incorrect with that of pubchem was 2.0.1, versions 2.0.5 and 2.0.6 corresponded with the results from Pubchem(original code)

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia

✍️ Contribution period: Sarima Chiorlu #828

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

My motivation

WEEK 2

Smiles to IUPAC Translator

Install and run docker

Smiles2IUPAC translator model Explanation

Understanding Ersilia's Backend

ImageMol Installation guide

849

WEEK 3

Model Suggestions

Model 1

Model 2