AdedejiAdewole commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

AdedejiAdewole commented 1 year ago

I have followed the steps to install Ersislia but I encountered an error trying to run the help command. The error is pasted below

(base) adewoleadedeji (master #) ~ $ conda activate ersilia (ersilia) adewoleadedeji (master #) ~ $ ersilia --help Segmentation fault: 11 (ersilia) adewoleadedeji (master #) ~ $ cd ersilia (ersilia) adewoleadedeji (master) ersilia $ ersilia --help Segmentation fault: 11

AdedejiAdewole commented 1 year ago

Successfully solved this error using "conda install -c conda-forge protobuf".

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole

Are you runing on MacOS? this is probably related to the issue #591

AdedejiAdewole commented 1 year ago

Hello @GemmaTuron Thank you for your response. I was able to fix that error with issue #591 But I encounter an error while trying to fetch the model. The error is described below.

🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

expected str, bytes or os.PathLike object, not NoneType If this error message is not helpful, open an issue at:

https://github.com/ersilia-os/ersilia Or feel free to reach out to us at:
hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)

You will find the console log file in: /Users/adewoleadedeji/eos/current.log 38%|████████████████▉ | 3/8 [04:53<08:08, 97.70s/it]

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole

Please use thisIf you haven't, try to run your command in verbose mode (-v in the CLI) It will provide a better log file

AdedejiAdewole commented 1 year ago

Hello @GemmaTuron I ran the command in verbose mode I think The command ran was "ersilia -v fetch eos3b5e".

AdedejiAdewole commented 1 year ago

@GemmaTuron This is the full error message when I tried to fetch eos3b5e in verbose mode.

Remove all packages in environment /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e:

21:40:22 | DEBUG | Deleting /Users/adewoleadedeji/eos/isaura/lake/eos3b5e_local.h5 21:40:22 | DEBUG | Deleting /Users/adewoleadedeji/eos/isaura/lake/eos3b5e_public.h5 21:40:22 | INFO | Removing docker images and stopping containers related to eos3b5e Deleted Containers: 7afecde45deb0a35f5ad5e630b538252d41ecea66acc5554f0f97c37fce5741b

Total reclaimed space: 0B 21:40:23 | DEBUG | Running docker images > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-szjoakf3/docker-images.txt 21:40:23 | DEBUG | Model entry eos3b5e was not available in the fetched models registry 21:40:23 | SUCCESS | Model eos3b5e deleted successfully Preparing model: 15.32066297531128s
25%|███████████▎ | 2/8 [00:16<00:57, 9.62s/it]21:40:50 | DEBUG | Cloning from github to /Users/adewoleadedeji/eos/dest/eos3b5e Cloning into 'eos3b5e'... remote: Enumerating objects: 47, done. remote: Counting objects: 100% (47/47), done. remote: Compressing objects: 100% (41/41), done. remote: Total 47 (delta 15), reused 17 (delta 4), pack-reused 0 Receiving objects: 100% (47/47), 25.89 KiB | 73.00 KiB/s, done. Resolving deltas: 100% (15/15), done. rm: /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-gypm85: is a directory 21:40:53 | INFO | 🚀 Model starting... 21:40:53 | DEBUG | {'version': '0.11.0', 'slim': False, 'python': 'py37'} Getting model: 29.204281091690063s
38%|████████████████▉ | 3/8 [00:46<01:32, 18.56s/it]21:40:53 | DEBUG | Check if model can be run with vanilla (system) code (i.e. dockerfile has no installs) 21:40:53 | DEBUG | Check bentoml and python version 21:40:53 | INFO | BentoML version {'version': '0.11.0', 'slim': False, 'python': 'py37'} 21:40:53 | DEBUG | Custom Ersilia BentoML is used, no need for modifying protobuf version 21:40:53 | DEBUG | Model needs some installs 21:40:53 | DEBUG | Checking if only python/conda install will be sufficient 21:40:53 | DEBUG | Mode: conda 21:40:53 | DEBUG | Trying to remove path: /Users/adewoleadedeji/bentoml/repository/eos3b5e 21:40:53 | DEBUG | ...successfully 21:40:53 | DEBUG | ...but path did not exist! 21:40:53 | DEBUG | Initializing conda packer 21:40:53 | DEBUG | Packing model with Conda 21:40:53 | DEBUG | Writing install commands 21:40:53 | DEBUG | Run commands: ['pip install rdkit-pypi'] 21:40:53 | DEBUG | Writing install commands in /Users/adewoleadedeji/eos/dest/eos3b5e/model_install_commands.sh 21:40:53 | DEBUG | Setting up 21:40:53 | DEBUG | Installs file /Users/adewoleadedeji/eos/dest/eos3b5e/model_install_commands.sh 21:40:53 | DEBUG | Conda environment eos3b5e 21:40:56 | DEBUG | Environment eos3b5e does not exist 21:40:58 | INFO | Cloning base conda environment and adding model dependencies Source: /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37 Destination: /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e Packages: 14 Files: 5758

Downloading and Extracting Packages

Preparing transaction: done Verifying transaction: done Executing transaction: done #

To activate this environment, use

#

$ conda activate eos3b5e

#

To deactivate an active environment, use

#

$ conda deactivate

21:41:23 | DEBUG | Run commandlines on eos3b5e 21:41:23 | DEBUG | python -m pip --disable-pip-version-check install rdkit-pypi python -m pip --disable-pip-version-check install git+https://github.com/ersilia-os/bentoml-ersilia.git

21:41:25 | DEBUG | Activating base environment 21:41:25 | DEBUG | Current working directory: /Users/adewoleadedeji/eos/dest/eos3b5e 21:41:25 | DEBUG | Running bash /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-8oayy7ix/script.sh > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-lydnools/command_outputs.log 2>&1 conda activate eos3b5e 21:47:29 | DEBUG | # conda environments: # base /Users/adewoleadedeji/opt/anaconda3 eos3b5e * /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e eosbase-bentoml-0.11.0-py37 /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37 ersilia /Users/adewoleadedeji/opt/anaconda3/envs/ersilia test /Users/adewoleadedeji/opt/anaconda3/envs/test tf /Users/adewoleadedeji/opt/anaconda3/envs/tf

Collecting rdkit-pypi Using cached rdkit_pypi-2022.9.5-cp37-cp37m-macosx_10_9_x86_64.whl (24.7 MB) Collecting Pillow Using cached Pillow-9.4.0-2-cp37-cp37m-macosx_10_10_x86_64.whl (3.3 MB) Requirement already satisfied: numpy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from rdkit-pypi) (1.21.6) Installing collected packages: Pillow, rdkit-pypi Successfully installed Pillow-9.4.0 rdkit-pypi-2022.9.5 Collecting git+https://github.com/ersilia-os/bentoml-ersilia.git Cloning https://github.com/ersilia-os/bentoml-ersilia.git to /private/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/pip-req-build-zoqvfah8 Running command git clone --filter=blob:none --quiet https://github.com/ersilia-os/bentoml-ersilia.git /private/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/pip-req-build-zoqvfah8 Resolved https://github.com/ersilia-os/bentoml-ersilia.git to commit a0f0040a1198e8f1704f0395e5d9ce328aaecf71 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: numpy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.21.6) Requirement already satisfied: werkzeug in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.2.3) Requirement already satisfied: psutil in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (5.9.4) Requirement already satisfied: alembic in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.10.1) Requirement already satisfied: sqlalchemy-utils in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.40.0) Requirement already satisfied: tabulate in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.9.0) Requirement already satisfied: humanfriendly in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (10.0) Requirement already satisfied: packaging in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (23.0) Requirement already satisfied: multidict in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (6.0.4) Requirement already satisfied: ruamel.yaml in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.17.21) Requirement already satisfied: python-json-logger in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.0.7) Requirement already satisfied: flask in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.2.3) Requirement already satisfied: boto3 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.26.85) Requirement already satisfied: docker in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (6.0.1) Requirement already satisfied: sqlalchemy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.0.5.post1) Requirement already satisfied: requests in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.28.2) Requirement already satisfied: cerberus in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.3.4) Requirement already satisfied: protobuf<3.19,>=3.8.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (3.18.3) Requirement already satisfied: prometheus-client in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.16.0) Requirement already satisfied: chardet in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (5.1.0) Requirement already satisfied: typing-extensions>=4 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (4.5.0) Requirement already satisfied: Mako in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (1.2.4) Requirement already satisfied: importlib-metadata in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (6.0.0) Requirement already satisfied: importlib-resources in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (5.12.0) Requirement already satisfied: greenlet!=0.4.17 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from sqlalchemy->bentoml==0.11.0) (2.0.2) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (1.0.1) Requirement already satisfied: botocore<1.30.0,>=1.29.85 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (1.29.85) Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (0.6.0) Requirement already satisfied: setuptools in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from cerberus->bentoml==0.11.0) (65.6.3) Requirement already satisfied: urllib3>=1.26.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from docker->bentoml==0.11.0) (1.26.14) Requirement already satisfied: websocket-client>=0.32.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from docker->bentoml==0.11.0) (1.5.1) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (3.1.0) Requirement already satisfied: certifi>=2017.4.17 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (2022.12.7) Requirement already satisfied: idna<4,>=2.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (3.4) Requirement already satisfied: itsdangerous>=2.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (2.1.2) Requirement already satisfied: click>=8.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (8.1.3) Requirement already satisfied: Jinja2>=3.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (3.1.2) Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from werkzeug->bentoml==0.11.0) (2.1.2) Requirement already satisfied: ruamel.yaml.clib>=0.2.6 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from ruamel.yaml->bentoml==0.11.0) (0.2.7) Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.85->boto3->bentoml==0.11.0) (2.8.2) Requirement already satisfied: zipp>=0.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from importlib-metadata->alembic->bentoml==0.11.0) (3.15.0) Requirement already satisfied: six>=1.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.30.0,>=1.29.85->boto3->bentoml==0.11.0) (1.16.0)

21:47:43 | DEBUG | Activating base environment 21:47:43 | DEBUG | Current working directory: /Users/adewoleadedeji/eos/dest/eos3b5e 21:47:43 | DEBUG | Running bash /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-ws8w7z57/script.sh > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-nty1dbmd/command_outputs.log 2>&1 21:48:19 | DEBUG | # conda environments: # base /Users/adewoleadedeji/opt/anaconda3 eos3b5e * /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e eosbase-bentoml-0.11.0-py37 /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37 ersilia /Users/adewoleadedeji/opt/anaconda3/envs/ersilia test /Users/adewoleadedeji/opt/anaconda3/envs/test tf /Users/adewoleadedeji/opt/anaconda3/envs/tf

/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-ws8w7z57/script.sh: line 9: 28934 Segmentation fault: 11 python pack.py

Error message:

expected str, bytes or os.PathLike object, not NoneType If this error message is not helpful, open an issue at:

https://github.com/ersilia-os/ersilia Or feel free to reach out to us at:
hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)

You will find the console log file in: /Users/adewoleadedeji/eos/current.log 38%|████████████████▌ | 3/8 [08:14<13:44, 164.91s/it]

AdedejiAdewole commented 1 year ago

@pauline-banye @DhanshreeA Hello, sorry to be a bother but please can you assist in solving this for me?

ZakiaYahya commented 1 year ago

@pauline-banye @DhanshreeA Hello, sorry to be a bother but please can you assist in solving this for me?

Did you able to fetch it successfully or not?? i'm getting the same error. Tried to fetch the model hundred of times but no gain. It seems like many of us getting this error while fetching.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya and @AdedejiAdewole

Let's get this solved:

Please specify your system settings, I have yet to see these @AdedejiAdewole
Collect the log files and upload them to the GitHub issue, instead of pasting the whole error. Instructions on how to do that are in the guidelines, but in short, that is the command ersilia -v fetch modelname > my.log 2>&1
Once you get the log, look at it yourselves and try to point the source of error, and explain it. This is essential practice. @ZakiaYahya It might be you cannot fetch the model due to another issue, so please, read through the log files and try to point out the source of error.
Fetching a model a hundred times without making any change won't help. Please if you encounter issues, report them and try to debug before trying the exact same command again.
Please make sure you give all appropriate explanations before tagging mentors and supporters. Again, we have 4 whole weeks to contribute so do not stress, we will solve all the issues.

AdedejiAdewole commented 1 year ago

Thank you @GemmaTuron and @ZakiaYahya I'm using a macOS Monterey with the Intel chip. From the log file generated trying to fetch this model, it seems that the BentoML location is none and I think this is what terminates the process of trying to fetch the model. The log file is attached below, if you scroll to the end before the process terminates, you would see the DEBUG process of trying to open BentoML but says it is None and proceeded to print an error message "expected str, bytes or os.PathLike object, not NoneType"

my.log

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole

Thanks for the explanation and the log file! Actually, I think the source of the error is in line 187 /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-20iuz3d0/script.sh: line 9: 68314 Segmentation fault: 11 python pack.py

Segmentation fault is a MacOs issue, we have been encountering them since the latest MacOS update. See the issue #610 and try to understand if this is what is happening --> try to create a new conda environment for ersilia with higher python version

AdedejiAdewole commented 1 year ago

Hi @GemmaTuron from issue #610, he said the solution he found is:

At model fetching time, we check if we are on an ARM64 platform.
If true, then we check if Python version in the model’s Dockerfile is below 3.10 (to be inclusive with M1 & M2 chips). If true, we overwrite the string to py310.
Then, Ersilia will install this model in a conda environment based on Python 3.10 instead of 3.7, which makes it M1/M2 compatible.

Can you shed more light on how to perform these processes above?

From my understanding, the models were developed using python 3.7 and installing Ersilia with python 3.8 would cause problem because each individual model has its own Conda environment, which means that, even if Ersilia installed in an environment with Python 3.8 in my case, the model would be run in a separate environment corresponding to the specified version for the model.

He also mentioned that python 3.7 isn't compatible with M1, and to use python 3.8 when installing Ersilia, I did that and also ran this command you suggested in issue #591 to install protobuf but I'm still getting the same error shown in the log file below.

my.log

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole

Please, look through the log files before dumping them here, I see the latest one is actually a misspelling error probably? Could not identify model identifier or slug: modelname: make sure the model identifier is correctly written.

The Python 3.7 bump to higher versions is because Mac M1 chips no longer support Py37. Precisely because each model has its own conda env, using py3.8 in Ersilia won't affect the models. Those that have a hard requirement for Py3.7 will indeed not work on M1 chips but should work on the rest, we are slowly backtesting and updating them all, and also containerizing them in Docker. Once you have the updated log file running a successful command, explain which line is indicating the rror you have and we will try to sort it out

AdedejiAdewole commented 1 year ago

Hello @GemmaTuron

Still getting the same error trying to fetch the model, /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-jn6q_wlf/script.sh: line 9: 11012 Segmentation fault: 11 python pack.py.

The log file is provided below.

my.log

AdedejiAdewole commented 1 year ago

I've successfully tested the ersilia model on Google Colab because I'm having segmentation fault error when I tried to fetch the model on my local machine. I was advised to do it using Google Colab to get familiar its commands while a troubleshoot is carried out by @GemmaTuron.

The commands using Google Colab seem more complex but I was able to understand what those commands are doing. The output file of my model test is attached below.

eos3b5e_output.csv

I look forward to testing this model on my local machine and carrying out other tasks.

AdedejiAdewole commented 1 year ago

Motivation to work at Ersilia

Since graduating with a Bs.c in Computer Science in the year 2020, I haven't had enough opportunities to put into practice what I've learnt during and post university. However , I have spent a reasonable amount of time acquiring more knowledge and certifications in machine learning. I was granted an Udacity/AWS Nanodegree Scholarship to study Artificial Intelligence. During my 6 months period of study I was able to learn AI & ML techniques such as Neural Networks to build image classifiers and used these skills to build a flower image classifier.

I went on to acquire a certification in Specialisation in Machine Learning on Coursera where I learned Supervised (Neural Networks, Linear Regression, Logistic Regression and Decision Trees), Unsupervised Learning (Clustering, Anomaly Detection), Reinforcement learning and Recommender Systems. I am also familiar with Git and GitHub which will be necessary for open source projects and I am very good in documentations using the necessary tools. I have worked on a model that predicts NPK (Nitrogen, Phosphorus and Potassium) levels in soils and suggest NPK fertilisers quantity to be added to low NPK soils, this project involved using feature engineering to generate new data features to improve the data. I have also worked on Kaggle projects like House Prediction, Wine Quality Predictions, Stroke Predictions, Gemstone quality prediction and many others to improve my ML skills.

It is impressive what Ersilia has been able to achieve in just 3 years and I am most eager to be a part of it as it's been encouraging so far. The response to questions, the amount of work put in meeting contributors needs and requests, and the guide from the mentors shows how important Ersilia is to Edoardo, Miquel and Gemma and I'm privileged to be a part of this learning process. I anticipate the application of the knowledge gained so far and would be honored to apply this newly amassed knowledge in the infectious & neglected disease research field. The possibility and importance of using ML/AI to improve world's health cannot be overemphasised and I look forward to being a part of it.

Thank you for this opportunity

AdedejiAdewole commented 1 year ago

I studied two of the models available and decided to select STOUT because of some reasons;

The use of neural networks to generate IUPAC names for chemical compounds from their structures and substructures and vice versa makes it less cumbersome, more accurate and efficient. It would be interesting to see how this was developed.
1. Prior to this, due to its algorithmic complexity and large set of rules, IUPAC name generation is missing in many cheminformatics toolkits in general so this provides another automatic tool for IUPAC name generation. This would make things a lot easier in the area of drug development.
2. I was really into chemistry in college and this brings a lot back.
3. I wasn't able to install and run the other models due to some incompatibilities.

The use of SMILES (Simplified Molecular Input Line Entry System), which are more concise forms of line representations primarily designed to be understood by machines, has been incorporated into many major open source and cheminformatics toolkits.

This research involves the use of Neural Machine Translation for the conversion of machine readable chemical line notations such as SMILES into IUPAC names and vice versa. From all these, an idea to build a IUPAC-to-SMILES translator called STOUT emerged. The two chemical representations were treated as two different languages. Each SMILES string and corresponding IUPAC name was treated as two different sentences that have the same meaning in reality.

The effect of plenty and high quality data in training Machine Learning models cannot be over-emphasised. To achieve maximum and effective accuracy using NMT, it is important to have a large amount of high quality dataset and datasets were generated for SMILES-to-IUPAC names and also for IUPAC-to-SMILES names.

All the molecules were obtained from PubChem, an open molecule database and downloaded in SDF format. Hydrogen was removed from the molecules and converted to canonical SMILES strings using the CDK. 111 million molecules were obtained and filtered through a set of standard rules to produce a final 81 million molecules. These SMILES were then converted to IUPAC names using Chemaxon’s molconvert software.

I understood that the SMILES were converted to SELFIES, a less complex form and structure of chemical compounds to be used in the Neural Networks.

Two seperate datasets were created, a 30 million and 60 million dataset with corresponding IUPAC names and SELFIES respectively. Each IUPAC name and SELFIE was separated into tokens with a space as a delimiter.

The Network uses the auto encoder -decoder architecture. Input strings are fed to the encoder and the outputs of the encoder are fed into the decoder as its input and I understood that:

The encoder and decoder networks use RNNs and GRUs.
The encoder network generates the encoder output and hidden state.
The attention weight is then calculated by the attention mechanism implemented in the network.
Encoder output with attention weight then creates context vector while the decoder outputs are passed through an embedding layer.
The output generated by this embedding layer and the context vector are concatenated and passed onto the GRUs on the decoder.

Basically what this means is that the same network architecture is used for both translations by swapping the input/outputs datasets.

Hyperparameters

The Adam optimiser with learning rate of 0.0005 being used means it'll take a longer time for the loss function to converge but will probably provide better accuracy.
The sparse categorical cross-entropy is used for gradient descent since its a multi-class-multi-label classification.
Batch size of 256 strings and 1024 strings are used for the GPU and TPU respectively.

This model shows the importance of using strong processing units because training a NN with a CPU with that amount of data will take lots of months or will probably not be able to complete training due to interruptions and other factors. Here, the average training epoch with a strong GPU takes 27h while it is reduced to about 2h using a very strong TPU. This proves the importance of strong processing units in Machine Learning.

Model Testing 2.2 million molecules were used for testing and BLEU scoring was used for accuracies of the predictions as well as Tanimoto similarities. Of course the predicted IUPAC names as outputs were needed to be converted back to SMILES using OPSIN to be able to use Tanimoto similarity calculations for accuracy of those predictions. This was very interesting to me.

I also understood that the difference between training time of SELFIES-IUPAC name and IUPAC-SELFIES was as a result of the complexity of IUPAC names. IUPAC names contain more and complex strings so SELFIES-IUPAC name translation will take more training time since unpacking and reprocessing IUPAC names to SMILES will take more time.

It would be interesting to see the skeleton of this neural network architecture, the number of layers, units of each layer, activation function, the methods involved in reducing bias and variance. To know if regularisation was implemented if bias or variance was encountered.

This is a very interesting work and I would like to learn more and be involved in it so I have studied and installed the model to my local machine.

AdedejiAdewole commented 1 year ago

After successfully installing the STOUT model to my system, I was able to run predictions on my local machine. The steps to install and run the model are listed below:

Installed all the dependencies listed in the Python_requirements.txt file.
Installed STOUT with pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git on my CLI.
Activated STOUT after completing installation.
Ran python by entering python3.
Ran the codes in the Simple usage in the repo.

The usage of the model on my CLI is shown below:

from STOUT import translate_forward, translate_reverse 2023-03-17 10:08:20.131610: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name) IUPAC name of CN1C=NC2=C1C(=O)N(C(=O)N2C)C is: 1,3,7-trimethylpurine-2,6-dione IUPAC_name = "1,3,7-trimethylpurine-2,6-dione" SMILES = translate_reverse(IUPAC_name) SMILES = translate_reverse(IUPAC_name) SMILES = translate_reverse(IUPAC_name) print("SMILES of "+IUPAC_name+" is: "+SMILES) print("SMILES of "+IUPAC_name+" is: "+SMILES) SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C

from STOUT import translate_forward, translate_reverse imports the two functions to perform both translations as explained in the publication, the translate_forward function translates SMILES to IUPAC, SMILES strings are fed into the neural network and the corresponding IUPAC name is produced as output. Vice versa, the translate_reverse translates IUPAC to SMILES, the more complex IUPAC string is fed into the neural network and produces corresponding SMILES as output. I'm guessing the transformation of SMILES to SELFIES and SELFIES back to SMILES will take place in the functions, the SMILES are transformed to SELFIES because of it less complex form and are easier to be unpacked when being fed into the neural network.

This must be a multi-label-multi-class classification, I would like to see how the multi labels were generated and put together because I was also working on a similar model before the start of this internship so this will provide more insight.

I will now run the model on Ersilia model hub to compare the results. I will be using Google Colab as I've been doing because I wasn't able to fetch the initial model on my local machine due to segmentation faults.

AdedejiAdewole commented 1 year ago

I have fetched the model corresponding to the STOUT model on the Ersilia Model Hub using Google Colab, the first five predictions are shown below:

                       key  \

0 MCGSCOLBFJQGHM-SCZZXKLOSA-N
1 GZOSMCIZMLWJML-VJLLXTKPSA-N
2 BZKPWHYZMXOIDC-UHFFFAOYSA-N
3 QTBSBXVTEAMEQO-UHFFFAOYSA-N
4 PWKSKIMOESPYIA-BYPYZUCNSA-N

                                           input  \

0 Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1
1 C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1 2 CC(=O)Nc1nnc(S(N)(=O)=O)s1
3 CC(=O)O
4 CC(=O)NC@@HC(=O)O

                                    iupacs_names

0 [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol 1 (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol 2 N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide 3 aceticacid
4 (2R)-2-acetamido-3-sulfanylpropanoicacid

The results of the Ersilia model are somewhat similar to the STOUT results in a way that it converts molecules represented as SMILES to IUPAC names but it doesn't convert IUPAC names back to SMILES. The model took approximately 22.342 minutes to make 442 SMILES predictions even while ran on a GPU. It shows the importance of a good processing unit when training and even making predictions.

I would like to see IUPAC-To-SMILES translation also incorporated into this model and be part of it.

AdedejiAdewole commented 1 year ago

I have tried to predict the first five SMILES contained in Ersilia's SMILES file on the STOUT model. I did this to properly compare the results of both models and the codes and outputs are provided below:

from STOUT import translate_forward, translate_reverse 2023-03-17 13:14:44.147335: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILE1 = 'Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1' SMILE2 = 'C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1' SMILE3 = 'CC(=O)Nc1nnc(S(N)(=O)=O)s1' SMILE4 = 'CC(=O)O' SMILE5 = 'CC(=O)NC@@HC(=O)O'

IUPAC_name1 = translate_forward(SMILE1) IUPAC_name2 = translate_forward(SMILE2) IUPAC_name4 = translate_forward(SMILE4) IUPAC_name3 = translate_forward(SMILE3) IUPAC_name5 = translate_forward(SMILE5)

print("----------SMILES---------" + "\n" + SMILE1 + "\n" + SMILE2 + "\n" + SMILE3 + "\n" + SMILE4 +"\n" + SMILE5 )

----------SMILES--------- Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1 C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1 CC(=O)Nc1nnc(S(N)(=O)=O)s1 CC(=O)O CC(=O)NC@@HC(=O)O

print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 )

----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid

The output results of the first five SMILES strings in the STOUT model are not completely the same as the output results of the first five SMILES strings in the Ersilia model as shown above.

These inputs, SMILES (Simplified Molecular Input Line Entry System), are more concise forms of line representations of molecular structures of chemical compounds that are primarily designed to be understood by machines.

The corresponding outputs are the IUPAC names of the molecular structures of the chemical compounds. IUPAC names follow an established set of rules for the chemical nomenclature of the molecular structures of chemical compounds.

The first, second and third predictions are different while the rest of the predictions are the same. Although the first predictions from both models are almost alike. I wonder if they are just both different forms of names of the chemical compounds or the model didn't accurately predict them.

carcablop commented 1 year ago

Hello @AdedejiAdewole The results of the model must be the same, it is the same code. You must pass SMILES as input to the original model (STOUT). From the eml_canonical.csv file it must be the column with the name "smiles" not "can_smiles" (canonical smiles), the model STOUT alredy process the input. If you pass the firts molecule like "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", the result of the both model should be the same.

AdedejiAdewole commented 1 year ago

Hello @carcablop Thank you for your insights. I have checked the "smiles" column and changed the column name to that, this is the first SMILE in it- "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1". The STOUT model still gives different outputs from the Ersilia implementation on Google Colab.

The picture below shows the data columns in the eml.canonical.csv file:

Screenshot 2023-03-17 at 20 04 19

carcablop commented 1 year ago

Hi @AdedejiAdewole It is strange that this is a different output. I have decided to test the original model (STOUT) passing it a molecule as input, the same inputs that you have shared and I get the same output that you have shared from google colab. That is to say, it gives me the same output as a result. I share a log of my output passing the molecule: "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1": ` IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 is: [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol`` Even if I pass it the molecule from the can_smile column it also gives the same result.

This is the complete output from the STOUT model (original model) passing the entire eml_canonical file to it. out_predictions(2).csv

Can you give a detailed explanation of the steps you have done to obtain predictions from the STOUT model?, for example if you have created a script read the input file?, Can you also provide details of your environment created to run the model?.

AdedejiAdewole commented 1 year ago

Okay @carcablop I installed the STOUT model using 'pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git'. Went on to running python using 'python3' then ran the python codes in this order:

from STOUT import translate_forward, translate_reverse This produced this message _2023-03-17 22:08:50.939921: I tensorflow/core/platform/cpu_featureguard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags Then it prompted me to enter the next line of code. SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

This was the output 'IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 is: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol'

carcablop commented 1 year ago

I installed it with the command pip install STOUT-pypi, at that time the version was 2.0.1. (in a conda environment with python 3.7) If we look at the version history here: https://pypi.org/project/STOUT-pypi/#history You can see that they are in version 2.0.5. I think this is what would be making the difference.

AdedejiAdewole commented 1 year ago

Okay, mine is also version 2.0.5 actually.

carcablop commented 1 year ago

Yes, and the model uses version 2.0.1.

AdedejiAdewole commented 1 year ago

Okay so are you suggesting that the two different versions will produce different outputs?

GemmaTuron commented 1 year ago

Hi @carcablop and @AdedejiAdewole

Thanks for these tests! It might be they have updated the translator from previous to the newest version. Regarding the translation from IUPAC to SMILES; the issue is that Ersilia at this moment is not accepting text as input, only smiles. This feature will be implemented soon! @AdedejiAdewole aside fromt tackling week 3 tasks, might I ask you to try installing the other version (the one Ersilia runs) and see if the output now coincides? we might want to bump Ersilia's model version to the latest one

AdedejiAdewole commented 1 year ago

Hello @GemmaTuron Thank you for your response. I have installed and ran Ersilia's version of the STOUT model using Colab. It still gives the same different outputs when compared to the original model run on my local machine.

The original model outputs are: ----------SMILES--------- Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 CC(=O)Nc1sc(nn1)S(=O)=O CC(O)=O CC(=O)NC@@HC(=O)O ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid

Ersilia's outputs

Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 | [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
CC(=O)Nc1sc(nn1)S(=O)=O | N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
CC(O)=O | aceticacid
CC(=O)NC@@HC(O)=O | (2R)-2-acetamido-3-sulfanylpropanoicacid

HellenNamulinda commented 1 year ago

I have tried to predict the first five SMILES contained in Ersilia's SMILES file on the STOUT model. I did this to properly compare the results of both models and the codes and outputs are provided below:

from STOUT import translate_forward, translate_reverse 2023-03-17 13:14:44.147335: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILE1 = 'Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1' SMILE2 = 'C[C@]12CC[C@H]3C@@H[C@@h]1CC=C2c1cccnc1' SMILE3 = 'CC(=O)Nc1nnc(S(N)(=O)=O)s1' SMILE4 = 'CC(=O)O' SMILE5 = 'CC(=O)NC@@HC(=O)O' IUPAC_name1 = translate_forward(SMILE1) IUPAC_name2 = translate_forward(SMILE2) IUPAC_name4 = translate_forward(SMILE4) IUPAC_name3 = translate_forward(SMILE3) IUPAC_name5 = translate_forward(SMILE5) print("----------SMILES---------" + "\n" + SMILE1 + "\n" + SMILE2 + "\n" + SMILE3 + "\n" + SMILE4 +"\n" + SMILE5 ) ----------SMILES--------- Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1 C[C@]12CC[C@H]3C@@H[C@@h]1CC=C2c1cccnc1 CC(=O)Nc1nnc(S(N)(=O)=O)s1 CC(=O)O CC(=O)NC@@HC(=O)O print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 ) ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid

The output results of the first five SMILES strings in the STOUT model are not completely the same as the output results of the first five SMILES strings in the Ersilia model as shown above.

These inputs, SMILES (Simplified Molecular Input Line Entry System), are more concise forms of line representations of molecular structures of chemical compounds that are primarily designed to be understood by machines.

The corresponding outputs are the IUPAC names of the molecular structures of the chemical compounds. IUPAC names follow an established set of rules for the chemical nomenclature of the molecular structures of chemical compounds.

The first, second and third predictions are different while the rest of the predictions are the same. Although the first predictions from both models are almost alike. I wonder if they are just both different forms of names of the chemical compounds or the model didn't accurately predict them.

Hello @AdedejiAdewole, As you can see from the output of both models (SMILES to IUPAC names), the original model and the one available on The Ersilia Model Hub give the same translations. For the first 5 miles in the eml dataset, Your output of the original model from the authors' repository is

print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 )

----------IUPAC NAMES---------
[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
aceticacid
(2R)-2-acetamido-3-sulfanylpropanoicacid

For the Ersilia model Hub model, you reported output as

----------IUPAC NAMES---------
[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
aceticacid
(2R)-2-acetamido-3-sulfanylpropanoicacid

As you can see, the two models give the same translations.

The translations are not that accurate when compared with the correct IUPAC names, i.e

abacavir
abiraterone
acetazolamide
acetic acid
acetylcysteine

This just means the Model is not is performing well on the test data. So these results(human evaluation) can be used to study the translations and fine-tune the model on more diverse data to improve the model's performance.

AdedejiAdewole commented 1 year ago

Hello @HellenNamulinda If you check my earlier comments well, you'll see that those are the outputs of the original and I provided different outputs from the Ersilia model. Thank you for your feedback, really important you checked and thought to make a comment.

HellenNamulinda commented 1 year ago

Hello @GemmaTuron Thank you for your response. I have installed and ran Ersilia's version of the STOUT model using Colab. It still gives the same different outputs when compared to the original model run on my local machine.

The original model outputs are: ----------SMILES--------- Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 CC(=O)Nc1sc(nn1)S(=O)=O CC(O)=O CC(=O)NC@@HC(=O)O ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid

Ersilia's outputs

Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 | [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol

C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol

CC(=O)Nc1sc(nn1)S(=O)=O | N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide

CC(O)=O | aceticacid

CC(=O)NC@@HC(O)=O | (2R)-2-acetamido-3-sulfanylpropanoicacid

Hi @AdedejiAdewole, You're right. I have seen the very first sample and the translations are slightly different. It seems to happen especially for some SMILES that are longer. My thinking is that this is happening because the versions are different.

AdedejiAdewole commented 1 year ago

Yes @HellenNamulinda I suspect that too.

GemmaTuron commented 1 year ago

Hi both,

IUPAC is a set of rules to for organic compounds. In some cases, the rules might be interpreted slightly different giving rise to small differences, but I think the model implementation is fine! Thanks for your tests, let's move onto week 3 tasks now

AdedejiAdewole commented 1 year ago

Model Name

DTI Prediction Model

Model Description

Python generated DL framework that takes SMILES string and protein amino acid pairs as input. They are fed into molecular encoders to convert compounds and proteins to their corressponding vector representations. These embedded representations are fed into the decoder to generate prediction outputs, these outputs are continuous binding scores and binary outputs indicating if a protein binds to a compound. A detection happens if a task is a regression or classification and uses the correct loss function and evaluation method.

MSE for regression
Binary Cross Entropy for classification

The interesting part of this model is the range of encoders, you can switch to required encoder model and connect to the decoder for predictions. It provides a range of 8 compound encoders and 7 protein encoders.

Slug

drug-target-prediction

Tag

Target identification, Embedding

Publication

https://doi.org/10.1093/bioinformatics/btaa1005

Supplementary Information

More information of the model and different encoders are provided in this pdf below

DeepPurpose_BIOINFO_SUPP (1).pdf

Source Code

https://github.com/kexinhuang12345/DeepPurpose

License

BSD 3-Clause License

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole !

This is a good example, we already have in our list of to does actually :) This is a large study with a lot of applications, so we would be starting by the Pretrained models, see if we can load and run them. Let's first find another model suggestion and if there is time we might try out the implementation!

AdedejiAdewole commented 1 year ago

Alright thank you @GemmaTuron Are we allowed to suggest models that has its implementation and usage on gitlab and not GitHub?

AdedejiAdewole commented 1 year ago

Model Name

PlasmidHunter: Accurate and fast prediction of plasmid sequences using gene content profile and machine learning

Model Description

Contrary to viruses, plasmids are extrachromosomal pieces of naked, double-stranded DNA that can spread within a host cell. Notwithstanding their advantages and importance as tools for gene therapy, medication development, and genetic engineering, they may be harmful to humans. For example, plasmids play a key role in causing antimicrobial resistance (AMR) among related bacterial species i.e enabling resistance to many commonly used antibiotics such as tetracycline and penicillin. Plasmids could also transmit virulence, toxicity and pathogenicity to a wider group of bacteria.

PlasmidHunter was created to serve as an identification tool that uses gene content profile alone as the feature to predict plasmid sequences with no need for the raw sequence data, sequence topology and coverage or assembly graph.

Input- any assembled sequence file produced by any modern high-throughput sequencer and assembled by any algorithm Output- Chromosomal or plasmid origin of the contigs Programming Language- Python

Slug

plasmid-hunter

Tag

Target identification

Publication

https://www.biorxiv.org/content/10.1101/2023.02.01.526640v1.full

Supplementary Information

More information of the model are provided in this pdf below

media-1.pdf

Source Code

https://github.com/tianrenmaogithub/PlasmidHunter

License

GPL-3.0 license

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole !

That's a nice model, but currently out of scope of Ersilia, since we focus on the drug discovery process and we are not dealing at this moment with genomic data! Let's try to find a third model that uses chemistry data instead of proteomics or genomic data

AdedejiAdewole commented 1 year ago

Model Name

Terpenes: The chemical space of Terpenes

Model Description

Terpenes are a wide range family of naturally occurring substances with different types of chemical and biological properties. Many of these molecules have already found use in pharmaceuticals. Characterisation of these wide range of molecules with classical approaches has proved to be a daunting task. This model provides more insight to identifying types of terpenes by using a natural product database, COCONUT to extract information about 60,000 terpenes. For clustering approach to this dataset, PCA, FastICA, Kernel PCA, t-SNE and UMAP were used as benchmark. For classification approach, Light gradient boosting machine, k-nearest neighbors, random forests, Gaussian naiive Bayes and Multilayer perceptron were used. The best performing algorithms yielded accuracy, F1 score, precision and other metrics all over 0.9.

Input- Terpene features Output- Chemical subclass Programming Language- Python

Slug

terpenes

Tag

Target identification

Publication

https://arxiv.org/abs/2110.15047

PDF format of paper

More information of the model are provided in this pdf below

2110.15047.pdf

Source Code

https://github.com/smortezah/napr

License

MIT

GemmaTuron commented 1 year ago

We are very interested in Natural products @AdedejiAdewole ! Can you add this model in our model suggestion list? And while you start preparing your final application, would you want to try and install this latest model, see if it is easy to implement? Thanks!

AdedejiAdewole commented 1 year ago

Good morning @GemmaTuron Sorry for the late response, something came up. I will do that now.

AdedejiAdewole commented 1 year ago

I have added the model to the model suggestion list. I am trying too install and implement the model now.

AdedejiAdewole commented 1 year ago

Hello @GemmaTuron. I have been on the Terpene model and encountered some issues running "pytest napr" to test napr after installation. These issues are related to my version of python and there is no python 3.10 version for Mac intel, some of these issues are:

Any python lower than 3.10 cannot run match-case statements so I had to change the match case statements to if else statements in all the files that had match case statements.
| operand is not supported in any python version lower than 3.10 so I had to import annotations in all the python files that has the | operand using "from future import annotations".

After doing the steps above, I was able to test napr with "pytest napr" and the test was successful.

(base) adewoleadedeji (master #) ~ $ pytest napr ============================= test session starts ============================== platform darwin -- Python 3.9.16, pytest-7.2.2, pluggy-0.12.0 rootdir: /Users/adewoleadedeji/napr, configfile: pyproject.toml plugins: anyio-3.5.0 collecting ... 2023-03-25 12:56:16.536824: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. collected 55 items

napr/napr/apps/coconut/terpene/tests/test_base_terpene.py ... [ 5%] napr/napr/apps/coconut/terpene/tests/test_explore.py ..... [ 14%] napr/napr/apps/coconut/terpene/tests/test_preprocessing.py ... [ 20%] napr/napr/data/tests/test_base_data.py . [ 21%] napr/napr/data/tests/test_load.py .. [ 25%] napr/napr/evaluation/tests/test_classification.py .......... [ 43%] napr/napr/hyperopt/tests/test_base_hyperopt.py .... [ 50%] napr/napr/plotting/tests/test_base_plotting.py .. [ 54%] napr/napr/utils/tests/test_decorators.py . [ 56%] napr/napr/utils/tests/test_helpers.py ............. [ 80%] napr/napr/utils/tests/test_random.py ...... [ 90%] napr/napr/utils/tests/test_stat.py ..... [100%]

============================= 55 passed in 14.15s ==============================

Now I am trying to see if this model can be implemented and used so I can properly understand the inputs and outputs.

AdedejiAdewole commented 1 year ago

I have tested the Terpene model on Jupyter using the Jupyter notebook provided in the Napr repo. Although I duplicated the repo after cloning so I don't mess up the structure of the original repo. The model works well and gives high accuracies. I had some import issues so I had to put the necessary python files in the right locations and all that but I got it to work and run.

Data Collection

The Terpene dataset was gotten from the COCONUT dataset (a natural product dataset). Only entries that belonged to the SuperClass “Lipids and lipid-like molecules” were selected and molecules that belonged to one of the following SubClasses: “Diterpenoids”, “Sesquiterpenoids”, “Monoterpenoids”, “Polyterpenoids”, “Sesquaterpenoids”, “Sesterterpenoids”, “Terpene glycosides”, “Terpene lactones” and “Triterpenoids” was further filtered.

Data Preprocessing

Categorical features (“textTaxa”, “bcutDescriptor”, “chemicalClass”, “chemicalSubClass”, “chemicalSuperClass” and “directParentClassification”) were first transformed. For example, the "textTaxa" feature was encoded by creating four new columns (“plants”, “marine”, “bacteria” and “fungi”) with the 1 assigned to a molecule's column if a corresponding taxonomy was present and 0 if absent. The "bcutDescriptor" contained arrays of six float numbers so they were all split and expanded into 6 separate columns. Terpenes belong to the same chemical class and chemical super class so "chemicalClass" and "chemicalSuperClass" so they were not needed. "chemicalSubClass' was the target so it was;nt encoded. The “directParentClassification” feature, contained 111 values, they were encoded by the integers 0 to 110. The dataset was split with 75% assigned to the training set and 25% to the test set. Missing data were filled using an imputer median method. Standardisation was also carried out.

Model Training

This model was trained with different types of methods (Light gradient boosting machine, k-nearest neighbours, random forests, Gaussian naïve Bayes and Multilayer perceptron) with the best-performing algorithms yielding accuracy, F1 score, precision and other metrics all over 0.9. The xgboost algorithm performed best and this was selected to train the model. Hyper parameter optimisation(tuning) was also carried out on this model and even yielded better accuracy with the best hyper parameters selected. The hyper parameter optimisation went through five trials and the best tuner yielded an accuracy of almost 100%. All the tuning results and the corresponding pickle file of the model were saved in a folder as tuning was going on.

Predictions

I saved the predictions of the test data into a csv file provided below.

predictions.csv

Notebook

My Jupyter notebook where I was able to implement and understand the processes of achieving this model is provided below.

Terpene-classification-classic.ipynb.zip

P.S

I added a code to reverse the label encoding done on the target labels so you're able to see the original formats of the predicted outputs(terpene subclasses).
The model predicts the Terpene subclasses using their physico-chemical descriptors.

GemmaTuron commented 1 year ago

Hi @AdedejiAdewole

That's great thanks. I was not aware Python 3.10 was not available for Mac Intels? I thought it was only lower python versions that did not have support. Have you checked the stable releases?

GemmaTuron commented 1 year ago

If you have time @AdedejiAdewole it would be great as well if you can have a look at this issue and let us know if the problems are persisting! https://github.com/ersilia-os/ersilia/issues/384

AdedejiAdewole commented 1 year ago

Good morning @GemmaTuron. Hope you had a great weekend. Yes, I did check the stable releases, the latest python version that was released for Mac intel was python 3.9.13 according to what is on that link you sent. The other versions are of macOS 64-bit universal2 installer, I'm not sure this would be supported on Mac intel at this moment.

ersilia-os / ersilia

✍️ Contribution period: Adedeji Adewole #613

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

To activate this environment, use

$ conda activate eos3b5e

To deactivate an active environment, use

$ conda deactivate

Model Name

Model Description

Slug

Tag

Publication

Supplementary Information

Source Code

License

Model Name

Model Description

Slug

Tag

Publication

Supplementary Information

Source Code

License

Model Name

Model Description

Slug

Tag

Publication

PDF format of paper

Source Code

License

Data Collection

Data Preprocessing

Model Training

Predictions

Notebook

P.S