Closed Richiio closed 11 months ago
Installing Ersilia Model Hub I was able to successfully install the ersilia model hub without any issues. To test if it was installed successfully, I ran the following commands and here are the output
The output for ersilia --help
Usage: ersilia [OPTIONS] COMMAND [ARGS]...
π¦ Welcome to Ersilia! π
Options: --version Show the version and exit. -v, --verbose Show logging on terminal when running commands. -s, --silent Do not echo any progress message. --help Show this message and exit.
Commands: api Run API on a served model auth Log in to ersilia to enter contributor mode. card Get model info card catalog List a catalog of models clear Clear ersilia close Close model current Get identifier of current model delete Delete model from local computer example Generate input examples for the model of interest fetch Fetch model from Ersilia Model Hub info Get model information run Run a served model sample Sample inputs and model identifiers serve Serve model test Test a model
Hi @Richiio thanks for the updates. Please proceed to test the simplest model (eos3b5e) as mentioned in the instructions and report your progress/any issues you run into here.
Alright @DhanshreeA
After successfully installing the model-eos3b5e using the following command:
ersilia -v fetch eos3b5e
ersilia serve eos3b5e
and
ersilia -v api run -i "CCCC"
and got the following output: { "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }
Hello everyone, I'm Sarima Chiorlu. I am a machine learning engineer hoping to make contributions in medicine and drug discovery. Ersilia goal in making research for drugs easier for researchers in different regions was my major selling point to choosing to work with them as an intern. Using AI/ML, I can easily get information on things like protein binding, discover the elements present in a drug molecule and make research faster.
I come from Nigeria and apart from the debilitating state of our healthcare, we are often faced with disease challenges, whether it is global as in the COVID-19 or it is affects only our country like the ebola virus. Our government doesn't invest in the healthcare sector. However, through the work that you do, researchers can now easily through machine learning identify proteins, etc in drug substances.
To further understand why I am interested in this project, we see that Nigeria has a high number of Sickle cell individuals, where we see that what kills them are the silliest of disease. Dealing with a sickle cell relative is hardwork, this is because their white blood count is quite low and they are prone to the simplest of disease. But your work in not only investing in drug discovery but drug discovery in regions ensures that these people stand a higher chance. I admire how Ersilia ensures that knowledge is not just transferred but also cultivated within these regions, fostering lasting impact. I wish to learn more about drug discovery while still working on your project and tasks. I firmly believe that the democratization of scientific tools and data-driven insights is necessary towards addressing some of the world's most pressing challenges.
I just wish to use my little knowledge to contribute to this vision and goal. I look forward to contributing to this project during my internship and also outside of the internship.
First contribution just made
I'm open to helping testing out some models where necessary.
Hi @Richiio, thanks for the detailed updates. I do not see your contribution on the Outreachy website. Please update that as well. Afterwards, you can go ahead and start looking at Week 2 tasks.
Hi @DhanshreeA, it's been updated. I was still working on putting it together. For week 2, I have decided to work on the Smiles-to-iUPAC translator which can be found here
Thanks for the updates @Richiio
Brief Background SMILES: which stands for Simplified Molecular Input Line Entry System. It is a widely used in the field of drug discovery and chemistry for representing chemical structures and molecules. IUPAC: Our good old way of representing organic compounds (take a step back to organic chemistry in secondary school, most of us hated naming it, we know.). You take into account, the bonds present in these compounds. However, generally, it helps researchers categorize and analyze drugs within specific chemical classes, which can be important for understanding their properties and potential applications.
To begin, I am working with Google Colab, the implementation is straight-forward, and a lot of dependencies are handled for you.
I first ran the command: !pip install STOUT-pypi
This installs the STOUT-pypi into our terminal runtime.
It should be noted that after running this command, you have to re-start your runtime to ensure all old dependencies that were uninstalled are no longer there, this is because if you don't you would run into errors along the way
Next, to test my installation, I ran the command :
from STOUT import translate_forward, translate_reverse
# SMILES to IUPAC name translation
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
# IUPAC name to SMILES translation
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)
And got the following Output:
Downloading trained model to /root/.data/STOUT-V2/models
/root/.data/STOUT-V2/models.zip
... done downloading trained model!
IUPAC name of CN1C=NC2=C1C(=O)N(C(=O)N2C)C is: 1,3,7-trimethylpurine-2,6-dione
SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C
This showed that all was working as they should.
Using the dataset provided for us in the internship handbook, the Essentials Medicine List which can be found here
To get started, I began by first saving the dataset to my local machine. This can be achieved in two ways:
I began by first running predictions on the main model. I had wanted to create a new column to save the can_smiles_to_IUPAC and smiles_to_IUPAC values for all the entries in the dataset. However, due to how large the dataset was. it took a long time to provide results. A time(>6hrs). I stopped the process and proceeded to taking the first 20 rows in my data. This time, I saved the smiles_to_IUPAC values in a new dataset which I called smiles and the can_smiles_to_IUPAC into a new dataset called can_smiles. The code I used can be found here:
import pandas as pd
from STOUT import translate_forward
# Loading data from the CSV file
df = pd.read_csv('/content/eml_canonical.csv')
# Selecting the first 20 rows
df = df.head(20)
# Create empty DataFrames for SMILES-to-IUPAC and CAN-SMILES-to-IUPAC translations
smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'smiles', 'iupac_name'])
can_smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'can_smiles', 'iupac_name'])
# Translate SMILES to IUPAC and append to smiles_IUPAC_name
for index, row in df.iterrows():
iupac_name = translate_forward(row['smiles'])
smiles_IUPAC_name = smiles_IUPAC_name.append({'drugs': row['drugs'], 'smiles': row['smiles'], 'iupac_name': iupac_name}, ignore_index=True)
# Translate CAN-SMILES to IUPAC and append to can_smiles_IUPAC_name
for index, row in df.iterrows():
iupac_name = translate_forward(row['can_smiles'])
can_smiles_IUPAC_name = can_smiles_IUPAC_name.append({'drugs': row['drugs'], 'can_smiles': row['can_smiles'], 'iupac_name': iupac_name}, ignore_index=True)
# Save the DataFrames to Excel files
smiles_IUPAC_name.to_excel('smiles.xlsx', index=False)
can_smiles_IUPAC_name.to_excel('can_smiles.xlsx', index=False)
The output files generated can be found below: can_smiles.xlsx smiles.xlsx
To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below:
After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue.
@DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)
After I had successfully fetched the model and served it. I began running predictions on my Essential Medicines List dataset. Before I began, I created a folder within my ersilia I cloned which I named data, I copy-pasted the datasets I got from my earlier generated output. The can_smiles
and smiles
dataset. This dataset was initially saved in microsoft excel spreadsheet format, this had to be changed as Ersilia only supports comma separated(csv) files, json files, .tsv or .hdf5.
I ran this commands to generate the output:
ersilia api run -i smiles.csv -o output.csv
This produced the output below:
output.csv - Ersilia's prediction for smiles_to_IUPAC
ersilia api run -i can_smiles.csv -o output2.csv
This produced the output below:
output2.csv - Ersilia's prediction for can_smiles_to_IUPAC
From the outputs generated from the original code and Ersilia's model implementation. Some are slightly close, some are different while some are the same. Let's have a look at the first 4 smiles from the Essential Medicine List to have a look at the research.
For the smiles, we have:
Drug: abacavir
Original code: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
Ersilia:
{
"input": {
"key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
"input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
"text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
},
"output": {
"outcome": [
"[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
]
}
}
Drug: abiraterone
Original code: (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
Ersilia:
{
"input": {
"key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
"input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
"text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
},
"output": {
"outcome": [
"(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
]
}
}
Drug: acetazolamide
Original code: N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
Ersilia:
{
"input": {
"key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
"input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
"text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
},
"output": {
"outcome": [
"N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
]
}
}
Drug: acetic acid
Original code: aceticacid
Ersilia:
{
"input": {
"key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N",
"input": "CC(O)=O",
"text": "CC(O)=O"
},
"output": {
"outcome": [
"aceticacid"
]
}
}
@DhanshreeA Are the results not meant to be the same consistently?
This was my first time working with docker. So I began with the installation.
# Updating package lists
sudo apt update
Next, I proceeded with installing the required packages to set up the Docker repository
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
Next, you add the Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
Next, I added the Docker repository to APT sources
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
Then we Install Docker Engine
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io
Start the Docker service
sudo service docker start
I'm using windows specifically Ubuntu 22.04
To confirm my docker installation, I ran the following command docker ps
and got the result shown below:
(base) root@Richio:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
This showed that I had no container running presently on docker. Next, I decided to try pulling a model from Ersilia's
(base) root@Richio:~# docker pull ersiliaos/eos4se9
Using default tag: latest
latest: Pulling from ersiliaos/eos4se9
8b91b88d5577: Pull complete
824416e23423: Pull complete
bbe2c2981082: Pull complete
7b6b68d15a5c: Pull complete
71f8f4db541d: Pull complete
4f4fb700ef54: Pull complete
278266b40c52: Pull complete
4298588a86ad: Pull complete
dddca77c0f59: Pull complete
a113a2030c72: Pull complete
0c8571d61669: Pull complete
Digest: sha256:3c0b4dab7a313bfb33c74b45ca378f7d69b0b9dbaaf843357780180910af31ab
Status: Downloaded newer image for ersiliaos/eos4se9:latest
docker.io/ersiliaos/eos4se9:latest
Then I ran:
(base) root@Richio:~# docker run ersiliaos/eos4se9
I reran the command docker ps
and got the following output
(base) root@Richio:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d6e3b2b2b5fd ersiliaos/eos4se9 "sh /root/docker-entβ¦" 48 seconds ago Up 13 seconds 80/tcp pedantic_antonelli
Taking a look at the publication, the following was extracted: "Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds."
This explains the STOUT model to be a smart tool, we use to translate chemical codes(called SMILES) into human-readable chemical names and vice versa. To check how well it does the translations, we use a score called BLEU as mentioned above, this tells us if the translations are accurate. An impressive BLEU score of 90% means it gets 90% of the translations right and a similarity index of 0.9 means that even when it makes mistakes, they are very similar to the correct answers. How does it do this? It breaks down the chemical codes into smaller pieces, sort of like words in a sentence. Then, it uses a neural network to make the translations.
Ersilia gathers models from research papers, conferences, etc and stores them either as docker containers, in GitHub repos and on AWS. In GitHub repos which is what has been in use until the latest addition of docker containers. As GitHub repos, once ersilia has been installed to your local machine. You can run predictions on models by simply fetching the models and running predictions. Using docker, once you have docker installed, you can fetch the model from Ersilia's dockerhub using the docker fetch command and running the container on your local machine. With that you can easily make predictions the same way you do with GitHub. Using AWS, my personal opinion is that the models would be stored in a warehouse, most likely AWS as S3 buckets. Once you connect to an S3 bucket, you can pull the model you want to your local repo and make predictions with it.
@DhanshreeA mentioned we can use CPU instead of GPU or CUDA
Creating our ImageMol environment and activating it
Step 1: conda create -n imagemol python=3.7.3
\
Step 2: conda activate imagemol
Step 3: conda install -c rdkit rdkit
Install Torch, Torchvision. We picked the package compatible with the cpu
Step 4: pip install https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl
Step 5: pip install https://download.pytorch.org/whl/cpu/torchvision-0.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl
Step 6: pip install torch torchvision torchaudio
Installing the required packages
Step 7: pip install torch-scatter
Step 8: pip install torch-cluster
Step 9: pip install torch-sparse
Step 10: pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0.html
Step 11: mkdir ckpts
Step 12: wget -P ckpts/ https://github.com/HongxinXiang/ImageMol/blob/master/ckpts/pretraining-toy/checkpoints/ImageMol_10.pth.tar
Step 13: sudo apt update
Step 14: sudo apt install libxrender1
Step 15: git clone https://github.com/HongxinXiang/ImageMol.git
Step 16: cd ImageMol, run pip install -r requirements.txt
, cd datasets, cd finetuning
Step 17: Pick a dataset you want to finetune on and download it into the finetuning folder, we would be working with the SARS-CoV-2.
Step 18: pip install gdown
Step 19: gdown --id 1UfROoqR_aU6f5xWwxpLoiJnkwyUzDsui #(id of the google file, this is basically the value between the d/ and before the /view
Step 20: mkdir SARS-CoV-2
Step 21: tar -xzvf SARS-CoV-2.tar.gz -C SARS-CoV-2
Finetuning our dataset on the pretrained model Step 22:
python finetune.py --gpu 0 \
--save_finetune_ckpt 1 \
--log_dir ./logs/toxcast \
--dataroot ./datasets/finetuning/SARS-CoV-2/SARS-CoV-2 \
--dataset 3CL_enzymatic_activity\
--task_type classification \
--resume ./ckpts/ImageMol.pth.tar \
--image_aug \
--lr 0.5 \
--batch 64 \
--epochs 20
You should get an output similar to this:
final results: highest_valid: 0.565, final_train: 0.521, final_test: 0.271
The full output is here:
(imagemol) root@Richio:~/ImageMol# python finetune.py --gpu 0 \
--save_finetune_ckpt 1 \
--log_dir ./logs/toxcast \
--dataroot ./datasets/finetuning/SARS-CoV-2/SARS-CoV-2 \
--dataset 3CL_enzymatic_activity\
--task_type classification \
--resume ./ckpts/ImageMol.pth.tar \
--image_aug \
--lr 0.5 \
--batch 64 \
--epochs 20
Warning: There's no GPU available on this machine, training will be performed on CPU.
Architecture: ResNet18
eval_metric: rocauc
/root/miniconda3/envs/imagemol/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, "
/root/miniconda3/envs/imagemol/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
=> no checkpoint found at './ckpts/ImageMol.pth.tar'
ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=1, bias=True)
)
params: {'total_params': 11177025, 'total_trainable_params': 11177025}
[train epoch 0] loss: 10.069: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [03:26<00:00, 41.39s/it]
[valid epoch 0] loss: 1774764032.000: 100%|βββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:15<00:00, 3.03s/it]
[valid epoch 0] loss: 1882428160.000: 100%|βββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:03<00:00, 3.36s/it]
[valid epoch 0] loss: 708053952.000: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.81s/it]
{'epoch': 0, 'patience': 0, 'Loss': 1774764032.0, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 1] loss: 14.596: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [01:43<00:00, 20.72s/it]
[valid epoch 1] loss: 12880836.800: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:18<00:00, 3.68s/it]
[valid epoch 1] loss: 11979074.000: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.99s/it]
[valid epoch 1] loss: 4480956.000: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.98s/it]
{'epoch': 1, 'patience': 0, 'Loss': 12880836.8, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 2] loss: 1.059: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:35<00:00, 7.05s/it]
[valid epoch 2] loss: 122073.475: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.29s/it]
[valid epoch 2] loss: 99631.984: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.05s/it]
[valid epoch 2] loss: 34290.859: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.93s/it]
{'epoch': 2, 'patience': 1, 'Loss': 122073.475, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 3] loss: 0.584: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:40<00:00, 8.14s/it]
[valid epoch 3] loss: 1309.997: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:12<00:00, 2.50s/it]
[valid epoch 3] loss: 304.833: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.94s/it]
[valid epoch 3] loss: 94.451: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.82s/it]
{'epoch': 3, 'patience': 2, 'Loss': 1309.99697265625, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 4] loss: 0.478: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:31<00:00, 6.35s/it]
[valid epoch 4] loss: 351.336: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.15s/it]
[valid epoch 4] loss: 2.393: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.77s/it]
[valid epoch 4] loss: 1.401: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.68s/it]
{'epoch': 4, 'patience': 3, 'Loss': 351.3359375, 'Train': 0.5211365211365212, 'Validation': 0.5648148148148149, 'Test': 0.2708333333333333}
[train epoch 5] loss: 0.538: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:39<00:00, 7.86s/it]
[valid epoch 5] loss: 466.242: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:12<00:00, 2.47s/it]
[valid epoch 5] loss: 835.374: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.07s/it]
[valid epoch 5] loss: 426.533: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.08s/it]
{'epoch': 5, 'patience': 0, 'Loss': 466.24150390625, 'Train': 0.49774774774774777, 'Validation': 0.5, 'Test': 0.5}
[train epoch 6] loss: 0.589: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:35<00:00, 7.05s/it]
[valid epoch 6] loss: 672.822: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.23s/it]
[valid epoch 6] loss: 1352.755: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.90s/it]
[valid epoch 6] loss: 629.823: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.05s/it]
{'epoch': 6, 'patience': 1, 'Loss': 672.82197265625, 'Train': 0.5, 'Validation': 0.5, 'Test': 0.5}
[train epoch 7] loss: 0.537: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:33<00:00, 6.79s/it]
[valid epoch 7] loss: 442.793: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.27s/it]
[valid epoch 7] loss: 890.216: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.81s/it]
[valid epoch 7] loss: 417.324: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.62s/it]
{'epoch': 7, 'patience': 2, 'Loss': 442.793310546875, 'Train': 0.49774774774774777, 'Validation': 0.5, 'Test': 0.5}
[train epoch 8] loss: 0.475: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:32<00:00, 6.43s/it]
[valid epoch 8] loss: 263.574: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.20s/it]
[valid epoch 8] loss: 549.651: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.68s/it]
[valid epoch 8] loss: 269.386: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.75s/it]
{'epoch': 8, 'patience': 3, 'Loss': 263.57353515625, 'Train': 0.48873873873873874, 'Validation': 0.5, 'Test': 0.5}
[train epoch 9] loss: 0.489: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:33<00:00, 6.64s/it]
[valid epoch 9] loss: 117.936: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.23s/it]
[valid epoch 9] loss: 245.917: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.74s/it]
[valid epoch 9] loss: 132.232: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.80s/it]
{'epoch': 9, 'patience': 4, 'Loss': 117.93641357421875, 'Train': 0.4509702009702009, 'Validation': 0.4814814814814815, 'Test': 0.5}
[train epoch 10] loss: 0.505: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:34<00:00, 6.96s/it]
[valid epoch 10] loss: 29.749: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.21s/it]
[valid epoch 10] loss: 70.138: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.78s/it]
[valid epoch 10] loss: 45.388: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.77s/it]
{'epoch': 10, 'patience': 5, 'Loss': 29.749237060546875, 'Train': 0.4222106722106722, 'Validation': 0.48842592592592593, 'Test': 0.359375}
[train epoch 11] loss: 0.497: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:32<00:00, 6.56s/it]
[valid epoch 11] loss: 3.561: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.15s/it]
[valid epoch 11] loss: 15.333: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.77s/it]
[valid epoch 11] loss: 11.147: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.84s/it]
{'epoch': 11, 'patience': 6, 'Loss': 3.5606117248535156, 'Train': 0.46197158697158697, 'Validation': 0.4444444444444444, 'Test': 0.21875}
[train epoch 12] loss: 0.497: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:33<00:00, 6.60s/it]
[valid epoch 12] loss: 0.636: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.19s/it]
[valid epoch 12] loss: 2.595: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.71s/it]
[valid epoch 12] loss: 2.038: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.73s/it]
{'epoch': 12, 'patience': 7, 'Loss': 0.6362054824829102, 'Train': 0.4875693000693001, 'Validation': 0.3819444444444444, 'Test': 0.23958333333333331}
[train epoch 13] loss: 0.457: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:33<00:00, 6.64s/it]
[valid epoch 13] loss: 0.488: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.07s/it]
[valid epoch 13] loss: 0.670: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.64s/it]
[valid epoch 13] loss: 0.460: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.68s/it]
{'epoch': 13, 'patience': 8, 'Loss': 0.4882702350616455, 'Train': 0.5756670131670132, 'Validation': 0.43287037037037035, 'Test': 0.38541666666666663}
[train epoch 14] loss: 0.502: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:31<00:00, 6.38s/it]
[valid epoch 14] loss: 0.447: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:11<00:00, 2.21s/it]
[valid epoch 14] loss: 0.558: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.72s/it]
[valid epoch 14] loss: 0.317: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.75s/it]
{'epoch': 14, 'patience': 9, 'Loss': 0.44681572914123535, 'Train': 0.4993069993069993, 'Validation': 0.36342592592592593, 'Test': 0.3541666666666667}
[train epoch 15] loss: 0.506: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:32<00:00, 6.42s/it]
[valid epoch 15] loss: 0.514: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.17s/it]
[valid epoch 15] loss: 0.539: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.98s/it]
[valid epoch 15] loss: 0.344: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.67s/it]
{'epoch': 15, 'patience': 10, 'Loss': 0.5143370151519775, 'Train': 0.5172383922383922, 'Validation': 0.4745370370370371, 'Test': 0.35416666666666663}
[train epoch 16] loss: 0.481: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:32<00:00, 6.48s/it]
[valid epoch 16] loss: 0.508: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.02s/it]
[valid epoch 16] loss: 0.538: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.76s/it]
[valid epoch 16] loss: 0.372: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.76s/it]
{'epoch': 16, 'patience': 11, 'Loss': 0.507773494720459, 'Train': 0.48128898128898123, 'Validation': 0.4745370370370371, 'Test': 0.35416666666666663}
[train epoch 17] loss: 0.474: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:31<00:00, 6.38s/it]
[valid epoch 17] loss: 0.484: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.05s/it]
[valid epoch 17] loss: 0.548: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.95s/it]
[valid epoch 17] loss: 0.323: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.77s/it]
{'epoch': 17, 'patience': 12, 'Loss': 0.4842677116394043, 'Train': 0.46664934164934163, 'Validation': 0.45833333333333337, 'Test': 0.35416666666666663}
[train epoch 18] loss: 0.460: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:30<00:00, 6.12s/it]
[valid epoch 18] loss: 0.472: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:10<00:00, 2.11s/it]
[valid epoch 18] loss: 0.559: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.71s/it]
[valid epoch 18] loss: 0.312: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.68s/it]
{'epoch': 18, 'patience': 13, 'Loss': 0.47157673835754393, 'Train': 0.5103950103950103, 'Validation': 0.5, 'Test': 0.5}
[train epoch 19] loss: 0.499: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:33<00:00, 6.68s/it]
[valid epoch 19] loss: 0.508: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:12<00:00, 2.41s/it]
[valid epoch 19] loss: 0.549: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.98s/it]
[valid epoch 19] loss: 0.321: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.80s/it]
{'epoch': 19, 'patience': 14, 'Loss': 0.5075236320495605, 'Train': 0.5150294525294525, 'Validation': 0.5, 'Test': 0.5}
final results: highest_valid: 0.565, final_train: 0.521, final_test: 0.271
I forgot to add, you'll have to edit the finetune.py file from model=model.cuda() to model=model. This is because we are not using cuda
To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below:
After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue.
@DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)
Hi @Richiio thank you for your suggestion. I should point out that the two models are different in that they do the opposite of each other. eos4se9
takes SMILES inputs and generates IUPAC names (text outputs), whereas eos5ecc
takes in IUPAC names (text inputs) and generates SMILES. Incorporating text inputs within Ersilia is still very much WIP hence this model is in the backlog to be implemented.
@DhanshreeA Are the results not meant to be the same consistently?
@Richiio could you report which version of STOUT you are using? Ersilia is currently using 2.0.1
. However the latest version of STOUT seems to be 2.0.6. If the base model within this package has been updated (whether in terms of its architecture or the training dataset), it is possible that there are differences across the two versions. We will have to see which is more accurate.
@Richiio I also noticed that you are working with models on Google Colab, could you mention why? Were there issues running them on your machine?
Thanks @DhanshreeA for the updates! I did encounter two challenges when installing the model. However, the challenges encountered occurred when I was working outside the conda environment.
All worked fine within the conda environment apart from the OSError: [Errno 0] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly
error which you would encounter if you do not have Java installed and added to your environment path and if you decide to use the conda install -c decimer stout-pypi
. I created the STOUT environment and installed STOUT-pypi using the pip
command instead of conda
The JVM error, I encountered was resolved by installing java using the command sudo apt install openjdk-8-jre
and setting JAVA to the environment variable which was done using the following commands:
nano ~/.bashrc
which opened the shell profile file in text editor and added the following command at the end of the file export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
. I saved the file and applied the changes using this command source ~/.bashrc
I created a file which I named sarima.py inside a folder I created, downloaded the dataset to the same file location as the python file and ran nano sarima.py
, this opened the shell and I pasted the following code:
import pandas as pd
from STOUT import translate_forward
# Loading data from the CSV file (using a relative path)
csv_file_path = 'eml_canonical.csv'
df = pd.read_csv(csv_file_path)
# Selecting the first 20 rows
df = df.head(20)
# Create empty DataFrames for SMILES-to-IUPAC and CAN-SMILES-to-IUPAC translations
smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'smiles', 'iupac_name'])
can_smiles_IUPAC_name = pd.DataFrame(columns=['drugs', 'can_smiles', 'iupac_name'])
# Translate SMILES to IUPAC and concatenate with smiles_IUPAC_name
for index, row in df.iterrows():
iupac_name = translate_forward(row['smiles'])
smiles_IUPAC_name = pd.concat([smiles_IUPAC_name, pd.DataFrame({'drugs': [row['drugs']], 'smiles': [row['smiles']], 'iupac_name': [iupac_name]})], ignore_index=True)
# Translate CAN-SMILES to IUPAC and concatenate with can_smiles_IUPAC_name
for index, row in df.iterrows():
iupac_name = translate_forward(row['can_smiles'])
can_smiles_IUPAC_name = pd.concat([can_smiles_IUPAC_name, pd.DataFrame({'drugs': [row['drugs'], 'can_smiles': [row['can_smiles']], 'iupac_name': [iupac_name]})], ignore_index=True)
# Save the DataFrames to CSV files
smiles_IUPAC_name.to_csv('smiles.csv', index=False)
can_smiles_IUPAC_name.to_csv('can_smiles.csv', index=False)
saved it and exited from the shell. I installed pandas and ran python sarima.py
. This ran and produced the following output
can_smiles.csv
smiles.csv
I decided to work with Colab as it as a lot straightforward and faster
@Richiio I also noticed that you are working with models on Google Colab, could you mention why? Were there issues running them on your machine?
Even using the above installation, the output was different to that of Ersilia's Model Hub
Model Version
(STOUT) root@Richio:~/Test# pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.5
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /root/miniconda3/envs/STOUT/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:
Ersilia's version:
(ersilia) root@Richio:~/Test# conda list
# packages in environment at /root/miniconda3/envs/ersilia:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
alembic 1.12.0 pypi_0 pypi
attrs 21.4.0 pypi_0 pypi
bentoml 0.11.0 pypi_0 pypi
blinker 1.6.2 pypi_0 pypi
boto3 1.28.61 pypi_0 pypi
botocore 1.31.61 pypi_0 pypi
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.7.22 hbcca054_0 conda-forge
cerberus 1.3.5 pypi_0 pypi
certifi 2023.7.22 pypi_0 pypi
chardet 5.2.0 pypi_0 pypi
charset-normalizer 3.3.0 pypi_0 pypi
chembl-webresource-client 0.10.8 pypi_0 pypi
click 8.1.7 pypi_0 pypi
docker 6.1.3 pypi_0 pypi
dockerfile-parse 2.0.1 pypi_0 pypi
easydict 1.10 pypi_0 pypi
emoji 2.8.0 pypi_0 pypi
ersilia 0.1.27 pypi_0 pypi
To compare the models with Ersilia's Hub implementation. I had to first identify the model name in Ersilia's repo. Initially I was trying to fetch model eos5ecc. I began fetching the model but encountered some errors midway. The log can be found below: first_model_log.txt After some time looking through it and reading the error message. I got to realise that I was getting the following error because the metadata.json file was empty. This file needed to have entries and not be empty. I contemplated raising an issue for this but decided to check if there was a second implementation of the model with it's metadata filled. I discovered the eos4se9 model which showed more activity history in comparison to the latter and had its metadata filled in. I tried fetching it and it completed without any issue. @DhanshreeA would it be better to close model eos5ecc to prevent further confusions in the future. Just my suggestion though :)
Hi @Richiio thank you for your suggestion. I should point out that the two models are different in that they do the opposite of each other.
eos4se9
takes SMILES inputs and generates IUPAC names (text outputs), whereaseos5ecc
takes in IUPAC names (text inputs) and generates SMILES. Incorporating text inputs within Ersilia is still very much WIP hence this model is in the backlog to be implemented.
Thanks for the clarification!
Even using the above installation, the output was different to that of Ersilia's Model Hub
Model Version
(STOUT) root@Richio:~/Test# pip show stout-pypi Name: STOUT-pypi Version: 2.0.5 Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0 Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator Author: Kohulan Rajan Author-email: kohulan.rajan@uni-jena.de License: MIT Location: /root/miniconda3/envs/STOUT/lib/python3.8/site-packages Requires: jpype1, pystow, tensorflow, unicodedata2 Required-by:
Ersilia's version:
(ersilia) root@Richio:~/Test# conda list # packages in environment at /root/miniconda3/envs/ersilia: # # Name Version Build Channel _libgcc_mutex 0.1 main _openmp_mutex 5.1 1_gnu alembic 1.12.0 pypi_0 pypi attrs 21.4.0 pypi_0 pypi bentoml 0.11.0 pypi_0 pypi blinker 1.6.2 pypi_0 pypi boto3 1.28.61 pypi_0 pypi botocore 1.31.61 pypi_0 pypi bzip2 1.0.8 h7b6447c_0 ca-certificates 2023.7.22 hbcca054_0 conda-forge cerberus 1.3.5 pypi_0 pypi certifi 2023.7.22 pypi_0 pypi chardet 5.2.0 pypi_0 pypi charset-normalizer 3.3.0 pypi_0 pypi chembl-webresource-client 0.10.8 pypi_0 pypi click 8.1.7 pypi_0 pypi docker 6.1.3 pypi_0 pypi dockerfile-parse 2.0.1 pypi_0 pypi easydict 1.10 pypi_0 pypi emoji 2.8.0 pypi_0 pypi ersilia 0.1.27 pypi_0 pypi
@Richiio, thank you for the progress. Kindly note that with your command, you're checking for the package versions in the ersilia environment. For model-specific packages, you can check for them in the Docker file of that particular model. navigate to the repository of that model in the ersilia-os organization on GitHub. Forexample, for model eos4se9, the package versions are specified in the dockerfile
You will notice that ersilia's model uses version STOUT 2.0.1
. So for comparison, try installing the same version using the command pip install STOUT-pypi==2.0.1
@HellenNamulinda Thanks for the correction! I created a new STOUT environment which I called STOUT2, activated it and ran the command pip install STOUT-pypi==2.0.1, downloaded the dataset and ran my python script. The number of SMILES to IUPAC predicted correctly improved but we still had some incorrect predictions. Those incorrect predictions, to confirm which was right and wrong. I went to pubchem and searched for the drug using its smile which showed the correct IUPAC name as seen here . The result corresponded more with that of the original model(version 2.0.5). Version 2.0.1 of the original model was also not a 100% correct, although most were correct.
The output files can_smiles.csv smiles.csv
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
Model Description DiffDock is a model for molecular docking in drug discovery. It takes a fresh approach by treating docking as a generative modeling problem, which helps it achieve better accuracy. DiffDock efficiently explores the possibilities of how small molecules bind to proteins by using a unique diffusion process. DiffDock outperforms traditional docking methods and deep learning techniques, achieving a 38% success rate on the PDBBind dataset, even excelling in computationally folded structures.
Model Identifier Slug: DiffDock
Model Characteristics Input: Ligand poses Task: Molecular docking Tag: Drug discovery, Protein-ligand interaction Output: Binding structure, Score
References Source Code Publication
License: MIT License
SELFormer: Molecular Representation Learning via SELFIES Language Models
Model description We've been used to SMILES as our form of input in predicting aqueous solubility of compounds. SELFormer is a transformer based architecture that uses SELFIES as input in order to learn the molecular representations of drugs. You start by first converting your SMILES input to SELFIES, a 100% valid, compact, and expressive notation, as input to learn molecular representations for drug discovery and development. SELFIES, compared to the widely used SMILES notation, overcomes various issues, such as non-canonical representations and the inability to capture spatial information. It pre-trains SELFormer on two million drug-like compounds and fine-tunes it for various molecular property prediction tasks.
Model Identifier Slug: SELFormer
Model Characteristics Input: SMILES Task: Molecular property prediction, drug discovery Tag: Molecular representation, SELFIES, transformer architecture Output: Predicted molecular properties
References Source code Publication
Model 3
Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery
Model Description Drug discovery is a costly and time-intensive endeavor, hindered by the rapid growth of chemical databases. To tackle this challenge, Deep Docking (DD) is introduced as a deep learning platform capable of rapidly and accurately docking billions of molecular structures. DD employs quantitative structure-activity relationship (QSAR) models trained on subsets of a chemical library to predict docking outcomes, efficiently removing unfavorable molecules iteratively. When combined with the FRED docking program, DD calculates docking scores for 1.36 billion molecules, achieving remarkable data reduction and enrichment of high-scoring molecules without sacrificing favorable results. DD's flexibility allows integration with any docking program, and its resources are publicly available.
Model Identifier Slug: DeepDock
Model Characteristics Input: SMILES Task: Molecular docking Tag: Drug discovery Output: Docked SMILES
References Source Code Publication
License: MIT License
Model
DeepAffinity: interpretable deep learning of compoundβprotein affinity through unified recurrent and convolutional neural networks
Model description This model aims to address the challenge of predicting compound-protein interactions (CPI) with high applicability, accuracy, and interpretability using sequence data alone. Drug discovery relies on understanding how molecules interact with proteins, and computational methods can accelerate this process. However, predicting compound-protein affinity from sequences has been limited in scope and interpretability. DeepAffinity proposes a way leveraging both labelled and unlabeled data for encoding molecular representations and predicting affinities. They use specialized representations of protein sequences, such as secondary structure predictions, and train a deep learning model that unifies recurrent and convolutional neural networks (CNNs). The model achieves impressive accuracy, with a relative error within 5-fold for test cases and 20-fold for new protein classes. It incorporates separate and joint attention mechanisms for interpretability, allowing them to predict and explain selective drug-target interactions.
Model Identifier Slug: DeepAffinity
Model Characteristics Input: SMILES Tag: Compound-Protein interaction, drug discovery Output: Predicted compound-protein affinity
References Source Code Publication
License: GPL-3.0 License
@HellenNamulinda Thanks for the correction! I created a new STOUT environment which I called STOUT2, activated it and ran the command pip install STOUT-pypi==2.0.1, downloaded the dataset and ran my python script. The number of SMILES to IUPAC predicted correctly improved but we still had some incorrect predictions. Those incorrect predictions, to confirm which was right and wrong. I went to pubchem and searched for the drug using its smile which showed the correct IUPAC name as seen here . The result corresponded more with that of the original model(version 2.0.5). Version 2.0.1 of the original model was also not a 100% correct, although most were correct.
drugs smiles iupac_name abacavir Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 [(1S,4R)-4-[6-(cyclopropylamino)-2-(methylamino)-6H-purin-9-yl]cyclopent-2-en-1-yl]methanol abiraterone C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 (3S,6aS,6bR,10aR,10bS)-6a,10a-dimethyl-7-pyridin-3-yl-1,2,3,4,6,7,10,10b-octahydrobenzo[a]azulen-3-ol acetazolamide CC(=O)Nc1sc(nn1)S(=O)=O N-[5-[amino(dioxo)-Ξ»6-sulfanyl]-1,3,4-thiadiazol-2-yl]acetamide acetic acid CC(O)=O aceticacid acetylcysteine CC(=O)NC@@HC(O)=O (2R)-2-acetamido-3-sulfanylpropanoicacid acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O 2-acetyloxybenzoicacid aciclovir NC1=NC(=O)c2ncn(COCCO)c2N1 9-(2-hydroxyethoxymethyl)-2-(methylamino)-3H-purin-6-one aclidinium OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 2-[[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy]-1,1-dithiophen-2-ylethanol amlodipine CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(6-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate amodiaquine CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)cyclohexa-1,3,5-trien-1-ol
The output files can_smiles.csv smiles.csv
Hi @Richiio many thanks for your efforts, especially for validating these compounds against PubChem- great effort! This is very helpful for us. As a bonus task (please note that it will not have any effect on your application if you cannot complete it), could you run 2.0.1 (the version that Ersillia uses), 2.0.5, and 2.0.6 on the EML file, and report the results in a csv. You can keep the columns as (smiles, ver_201, ver_205, ver_206), and if the time permits, please add an extra column for what PubChem has to say about these molecules. It will be very useful for us. Again, only if the time permits. It's a bonus task, and not required for your application.
@Richiio Additionally, thank you for the model suggestions, you can mark week 3 tasks as completed. :)
Thanks so much @DhanshreeA for getting back. For the extra task, I created a third environment which I called STOUT3 and installed STOUT-pypi using the pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
. To confirm I was using the right version. I ran the pip show command and got the following output
(STOUT3) root@Richio:~# pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.6
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /root/miniconda3/envs/STOUT3/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:
I created the columns for the various versions, retrieved the csv file, then proceeded to excel to create columns for the PubChem reference, I copy-pasted the SMILES and got their various IUPAC names. The final csv compilation can be found here:
Final Result - smiles (1).csv.csv
The version incorrect with that of pubchem was 2.0.1, versions 2.0.5 and 2.0.6 corresponded with the results from Pubchem(original code)
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application