Closed AdedejiAdewole closed 1 year ago
I have followed the steps to install Ersislia but I encountered an error trying to run the help command. The error is pasted below
(base) adewoleadedeji (master #) ~ $ conda activate ersilia (ersilia) adewoleadedeji (master #) ~ $ ersilia --help Segmentation fault: 11 (ersilia) adewoleadedeji (master #) ~ $ cd ersilia (ersilia) adewoleadedeji (master) ersilia $ ersilia --help Segmentation fault: 11
Successfully solved this error using "conda install -c conda-forge protobuf".
Hi @AdedejiAdewole
Are you runing on MacOS? this is probably related to the issue #591
Hello @GemmaTuron Thank you for your response. I was able to fix that error with issue #591 But I encounter an error while trying to fetch the model. The error is described below.
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨
Error message:
expected str, bytes or os.PathLike object, not NoneType If this error message is not helpful, open an issue at:
If you haven't, try to run your command in verbose mode (-v in the CLI)
Hi @AdedejiAdewole
Please use thisIf you haven't, try to run your command in verbose mode (-v in the CLI)
It will provide a better log file
Hello @GemmaTuron I ran the command in verbose mode I think The command ran was "ersilia -v fetch eos3b5e".
@GemmaTuron This is the full error message when I tried to fetch eos3b5e in verbose mode.
⬇️ Fetching model eos3b5e: molecular-weight
0%| | 0/8 [00:00<?, ?it/s]21:40:07 | INFO | GitHub CLI is not installed. Ersilia can work without it, but we highy recommend that you install this tool.
21:40:07 | DEBUG | Git LFS is installed
Updated Git hooks.
Git LFS initialized.
21:40:07 | DEBUG | Git LFS has been activated
21:40:08 | DEBUG | Connected to the internet
21:40:08 | DEBUG | Conda is installed
21:40:08 | DEBUG | EOS Home path exists
Checking setup: 1.476s
12%|█████▋ | 1/8 [00:01<00:10, 1.48s/it]21:40:08 | INFO | Starting delete of model eos3b5e
21:40:08 | INFO | Removing folder /Users/adewoleadedeji/eos/dest/eos3b5e
21:40:08 | INFO | Removing folder /Users/adewoleadedeji/eos/repository/eos3b5e
21:40:13 | INFO | Deleting conda environment eos3b5e
Remove all packages in environment /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e:
21:40:22 | DEBUG | Deleting /Users/adewoleadedeji/eos/isaura/lake/eos3b5e_local.h5 21:40:22 | DEBUG | Deleting /Users/adewoleadedeji/eos/isaura/lake/eos3b5e_public.h5 21:40:22 | INFO | Removing docker images and stopping containers related to eos3b5e Deleted Containers: 7afecde45deb0a35f5ad5e630b538252d41ecea66acc5554f0f97c37fce5741b
Total reclaimed space: 0B
21:40:23 | DEBUG | Running docker images > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-szjoakf3/docker-images.txt
21:40:23 | DEBUG | Model entry eos3b5e was not available in the fetched models registry
21:40:23 | SUCCESS | Model eos3b5e deleted successfully
Preparing model: 15.32066297531128s
25%|███████████▎ | 2/8 [00:16<00:57, 9.62s/it]21:40:50 | DEBUG | Cloning from github to /Users/adewoleadedeji/eos/dest/eos3b5e
Cloning into 'eos3b5e'...
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 47 (delta 15), reused 17 (delta 4), pack-reused 0
Receiving objects: 100% (47/47), 25.89 KiB | 73.00 KiB/s, done.
Resolving deltas: 100% (15/15), done.
rm: /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-gypm85: is a directory
21:40:53 | INFO | 🚀 Model starting...
21:40:53 | DEBUG | {'version': '0.11.0', 'slim': False, 'python': 'py37'}
Getting model: 29.204281091690063s
38%|████████████████▉ | 3/8 [00:46<01:32, 18.56s/it]21:40:53 | DEBUG | Check if model can be run with vanilla (system) code (i.e. dockerfile has no installs)
21:40:53 | DEBUG | Check bentoml and python version
21:40:53 | INFO | BentoML version {'version': '0.11.0', 'slim': False, 'python': 'py37'}
21:40:53 | DEBUG | Custom Ersilia BentoML is used, no need for modifying protobuf version
21:40:53 | DEBUG | Model needs some installs
21:40:53 | DEBUG | Checking if only python/conda install will be sufficient
21:40:53 | DEBUG | Mode: conda
21:40:53 | DEBUG | Trying to remove path: /Users/adewoleadedeji/bentoml/repository/eos3b5e
21:40:53 | DEBUG | ...successfully
21:40:53 | DEBUG | ...but path did not exist!
21:40:53 | DEBUG | Initializing conda packer
21:40:53 | DEBUG | Packing model with Conda
21:40:53 | DEBUG | Writing install commands
21:40:53 | DEBUG | Run commands: ['pip install rdkit-pypi']
21:40:53 | DEBUG | Writing install commands in /Users/adewoleadedeji/eos/dest/eos3b5e/model_install_commands.sh
21:40:53 | DEBUG | Setting up
21:40:53 | DEBUG | Installs file /Users/adewoleadedeji/eos/dest/eos3b5e/model_install_commands.sh
21:40:53 | DEBUG | Conda environment eos3b5e
21:40:56 | DEBUG | Environment eos3b5e does not exist
21:40:58 | INFO | Cloning base conda environment and adding model dependencies
Source: /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37
Destination: /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e
Packages: 14
Files: 5758
Downloading and Extracting Packages
Downloading and Extracting Packages
Preparing transaction: done Verifying transaction: done Executing transaction: done #
#
#
#
21:41:23 | DEBUG | Run commandlines on eos3b5e 21:41:23 | DEBUG | python -m pip --disable-pip-version-check install rdkit-pypi python -m pip --disable-pip-version-check install git+https://github.com/ersilia-os/bentoml-ersilia.git
21:41:25 | DEBUG | Activating base environment 21:41:25 | DEBUG | Current working directory: /Users/adewoleadedeji/eos/dest/eos3b5e 21:41:25 | DEBUG | Running bash /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-8oayy7ix/script.sh > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-lydnools/command_outputs.log 2>&1 conda activate eos3b5e 21:47:29 | DEBUG | # conda environments: # base /Users/adewoleadedeji/opt/anaconda3 eos3b5e * /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e eosbase-bentoml-0.11.0-py37 /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37 ersilia /Users/adewoleadedeji/opt/anaconda3/envs/ersilia test /Users/adewoleadedeji/opt/anaconda3/envs/test tf /Users/adewoleadedeji/opt/anaconda3/envs/tf
Collecting rdkit-pypi Using cached rdkit_pypi-2022.9.5-cp37-cp37m-macosx_10_9_x86_64.whl (24.7 MB) Collecting Pillow Using cached Pillow-9.4.0-2-cp37-cp37m-macosx_10_10_x86_64.whl (3.3 MB) Requirement already satisfied: numpy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from rdkit-pypi) (1.21.6) Installing collected packages: Pillow, rdkit-pypi Successfully installed Pillow-9.4.0 rdkit-pypi-2022.9.5 Collecting git+https://github.com/ersilia-os/bentoml-ersilia.git Cloning https://github.com/ersilia-os/bentoml-ersilia.git to /private/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/pip-req-build-zoqvfah8 Running command git clone --filter=blob:none --quiet https://github.com/ersilia-os/bentoml-ersilia.git /private/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/pip-req-build-zoqvfah8 Resolved https://github.com/ersilia-os/bentoml-ersilia.git to commit a0f0040a1198e8f1704f0395e5d9ce328aaecf71 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: numpy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.21.6) Requirement already satisfied: werkzeug in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.2.3) Requirement already satisfied: psutil in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (5.9.4) Requirement already satisfied: alembic in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.10.1) Requirement already satisfied: sqlalchemy-utils in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.40.0) Requirement already satisfied: tabulate in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.9.0) Requirement already satisfied: humanfriendly in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (10.0) Requirement already satisfied: packaging in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (23.0) Requirement already satisfied: multidict in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (6.0.4) Requirement already satisfied: ruamel.yaml in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.17.21) Requirement already satisfied: python-json-logger in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.0.7) Requirement already satisfied: flask in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.2.3) Requirement already satisfied: boto3 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.26.85) Requirement already satisfied: docker in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (6.0.1) Requirement already satisfied: sqlalchemy in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.0.5.post1) Requirement already satisfied: requests in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (2.28.2) Requirement already satisfied: cerberus in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (1.3.4) Requirement already satisfied: protobuf<3.19,>=3.8.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (3.18.3) Requirement already satisfied: prometheus-client in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (0.16.0) Requirement already satisfied: chardet in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from bentoml==0.11.0) (5.1.0) Requirement already satisfied: typing-extensions>=4 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (4.5.0) Requirement already satisfied: Mako in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (1.2.4) Requirement already satisfied: importlib-metadata in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (6.0.0) Requirement already satisfied: importlib-resources in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from alembic->bentoml==0.11.0) (5.12.0) Requirement already satisfied: greenlet!=0.4.17 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from sqlalchemy->bentoml==0.11.0) (2.0.2) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (1.0.1) Requirement already satisfied: botocore<1.30.0,>=1.29.85 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (1.29.85) Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from boto3->bentoml==0.11.0) (0.6.0) Requirement already satisfied: setuptools in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from cerberus->bentoml==0.11.0) (65.6.3) Requirement already satisfied: urllib3>=1.26.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from docker->bentoml==0.11.0) (1.26.14) Requirement already satisfied: websocket-client>=0.32.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from docker->bentoml==0.11.0) (1.5.1) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (3.1.0) Requirement already satisfied: certifi>=2017.4.17 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (2022.12.7) Requirement already satisfied: idna<4,>=2.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from requests->bentoml==0.11.0) (3.4) Requirement already satisfied: itsdangerous>=2.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (2.1.2) Requirement already satisfied: click>=8.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (8.1.3) Requirement already satisfied: Jinja2>=3.0 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from flask->bentoml==0.11.0) (3.1.2) Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from werkzeug->bentoml==0.11.0) (2.1.2) Requirement already satisfied: ruamel.yaml.clib>=0.2.6 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from ruamel.yaml->bentoml==0.11.0) (0.2.7) Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.85->boto3->bentoml==0.11.0) (2.8.2) Requirement already satisfied: zipp>=0.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from importlib-metadata->alembic->bentoml==0.11.0) (3.15.0) Requirement already satisfied: six>=1.5 in /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.30.0,>=1.29.85->boto3->bentoml==0.11.0) (1.16.0)
21:47:29 | DEBUG | Activation done 21:47:29 | DEBUG | Creating environment YAML file 21:47:39 | DEBUG | Storing Conda environment in the local environment database 21:47:39 | DEBUG | Done with the Conda setup 21:47:41 | DEBUG | Using environment eos3b5e 21:47:41 | DEBUG | Running command: python pack.py 21:47:41 | DEBUG | Run commandlines on eos3b5e 21:47:41 | DEBUG | python pack.py
21:47:43 | DEBUG | Activating base environment 21:47:43 | DEBUG | Current working directory: /Users/adewoleadedeji/eos/dest/eos3b5e 21:47:43 | DEBUG | Running bash /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-ws8w7z57/script.sh > /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-nty1dbmd/command_outputs.log 2>&1 21:48:19 | DEBUG | # conda environments: # base /Users/adewoleadedeji/opt/anaconda3 eos3b5e * /Users/adewoleadedeji/opt/anaconda3/envs/eos3b5e eosbase-bentoml-0.11.0-py37 /Users/adewoleadedeji/opt/anaconda3/envs/eosbase-bentoml-0.11.0-py37 ersilia /Users/adewoleadedeji/opt/anaconda3/envs/ersilia test /Users/adewoleadedeji/opt/anaconda3/envs/test tf /Users/adewoleadedeji/opt/anaconda3/envs/tf
/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-ws8w7z57/script.sh: line 9: 28934 Segmentation fault: 11 python pack.py
21:48:19 | DEBUG | Activation done 21:48:19 | DEBUG | Previous command successfully run inside eos3b5e conda environment 21:48:19 | DEBUG | Now trying to establish symlinks 21:48:19 | DEBUG | BentoML location is None 🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨
Error message:
expected str, bytes or os.PathLike object, not NoneType If this error message is not helpful, open an issue at:
If you haven't, try to run your command in verbose mode (-v in the CLI)
@pauline-banye @DhanshreeA Hello, sorry to be a bother but please can you assist in solving this for me?
@pauline-banye @DhanshreeA Hello, sorry to be a bother but please can you assist in solving this for me?
Did you able to fetch it successfully or not?? i'm getting the same error. Tried to fetch the model hundred of times but no gain. It seems like many of us getting this error while fetching.
Hi @ZakiaYahya and @AdedejiAdewole
Let's get this solved:
ersilia -v fetch modelname > my.log 2>&1
Thank you @GemmaTuron and @ZakiaYahya I'm using a macOS Monterey with the Intel chip. From the log file generated trying to fetch this model, it seems that the BentoML location is none and I think this is what terminates the process of trying to fetch the model. The log file is attached below, if you scroll to the end before the process terminates, you would see the DEBUG process of trying to open BentoML but says it is None and proceeded to print an error message "expected str, bytes or os.PathLike object, not NoneType"
Hi @AdedejiAdewole
Thanks for the explanation and the log file! Actually, I think the source of the error is in line 187
/var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-20iuz3d0/script.sh: line 9: 68314 Segmentation fault: 11 python pack.py
Segmentation fault is a MacOs issue, we have been encountering them since the latest MacOS update. See the issue #610 and try to understand if this is what is happening --> try to create a new conda environment for ersilia with higher python version
Hi @GemmaTuron from issue #610, he said the solution he found is:
Can you shed more light on how to perform these processes above?
From my understanding, the models were developed using python 3.7 and installing Ersilia with python 3.8 would cause problem because each individual model has its own Conda environment, which means that, even if Ersilia installed in an environment with Python 3.8 in my case, the model would be run in a separate environment corresponding to the specified version for the model.
He also mentioned that python 3.7 isn't compatible with M1, and to use python 3.8 when installing Ersilia, I did that and also ran this command you suggested in issue #591 to install protobuf but I'm still getting the same error shown in the log file below.
Hi @AdedejiAdewole
Please, look through the log files before dumping them here, I see the latest one is actually a misspelling error probably?
Could not identify model identifier or slug: modelname:
make sure the model identifier is correctly written.
The Python 3.7 bump to higher versions is because Mac M1 chips no longer support Py37. Precisely because each model has its own conda env, using py3.8 in Ersilia won't affect the models. Those that have a hard requirement for Py3.7 will indeed not work on M1 chips but should work on the rest, we are slowly backtesting and updating them all, and also containerizing them in Docker. Once you have the updated log file running a successful command, explain which line is indicating the rror you have and we will try to sort it out
Hello @GemmaTuron
Still getting the same error trying to fetch the model, /var/folders/1h/4jhzw_dd6cbfwwhqmgq5m6k80000gn/T/ersilia-jn6q_wlf/script.sh: line 9: 11012 Segmentation fault: 11 python pack.py.
The log file is provided below.
I've successfully tested the ersilia model on Google Colab because I'm having segmentation fault error when I tried to fetch the model on my local machine. I was advised to do it using Google Colab to get familiar its commands while a troubleshoot is carried out by @GemmaTuron.
The commands using Google Colab seem more complex but I was able to understand what those commands are doing. The output file of my model test is attached below.
I look forward to testing this model on my local machine and carrying out other tasks.
Motivation to work at Ersilia
Since graduating with a Bs.c in Computer Science in the year 2020, I haven't had enough opportunities to put into practice what I've learnt during and post university. However , I have spent a reasonable amount of time acquiring more knowledge and certifications in machine learning. I was granted an Udacity/AWS Nanodegree Scholarship to study Artificial Intelligence. During my 6 months period of study I was able to learn AI & ML techniques such as Neural Networks to build image classifiers and used these skills to build a flower image classifier.
I went on to acquire a certification in Specialisation in Machine Learning on Coursera where I learned Supervised (Neural Networks, Linear Regression, Logistic Regression and Decision Trees), Unsupervised Learning (Clustering, Anomaly Detection), Reinforcement learning and Recommender Systems. I am also familiar with Git and GitHub which will be necessary for open source projects and I am very good in documentations using the necessary tools. I have worked on a model that predicts NPK (Nitrogen, Phosphorus and Potassium) levels in soils and suggest NPK fertilisers quantity to be added to low NPK soils, this project involved using feature engineering to generate new data features to improve the data. I have also worked on Kaggle projects like House Prediction, Wine Quality Predictions, Stroke Predictions, Gemstone quality prediction and many others to improve my ML skills.
It is impressive what Ersilia has been able to achieve in just 3 years and I am most eager to be a part of it as it's been encouraging so far. The response to questions, the amount of work put in meeting contributors needs and requests, and the guide from the mentors shows how important Ersilia is to Edoardo, Miquel and Gemma and I'm privileged to be a part of this learning process. I anticipate the application of the knowledge gained so far and would be honored to apply this newly amassed knowledge in the infectious & neglected disease research field. The possibility and importance of using ML/AI to improve world's health cannot be overemphasised and I look forward to being a part of it.
Thank you for this opportunity
I studied two of the models available and decided to select STOUT because of some reasons;
The use of SMILES (Simplified Molecular Input Line Entry System), which are more concise forms of line representations primarily designed to be understood by machines, has been incorporated into many major open source and cheminformatics toolkits.
This research involves the use of Neural Machine Translation for the conversion of machine readable chemical line notations such as SMILES into IUPAC names and vice versa. From all these, an idea to build a IUPAC-to-SMILES translator called STOUT emerged. The two chemical representations were treated as two different languages. Each SMILES string and corresponding IUPAC name was treated as two different sentences that have the same meaning in reality.
The effect of plenty and high quality data in training Machine Learning models cannot be over-emphasised. To achieve maximum and effective accuracy using NMT, it is important to have a large amount of high quality dataset and datasets were generated for SMILES-to-IUPAC names and also for IUPAC-to-SMILES names.
All the molecules were obtained from PubChem, an open molecule database and downloaded in SDF format. Hydrogen was removed from the molecules and converted to canonical SMILES strings using the CDK. 111 million molecules were obtained and filtered through a set of standard rules to produce a final 81 million molecules. These SMILES were then converted to IUPAC names using Chemaxon’s molconvert software.
I understood that the SMILES were converted to SELFIES, a less complex form and structure of chemical compounds to be used in the Neural Networks.
Two seperate datasets were created, a 30 million and 60 million dataset with corresponding IUPAC names and SELFIES respectively. Each IUPAC name and SELFIE was separated into tokens with a space as a delimiter.
The Network uses the auto encoder -decoder architecture. Input strings are fed to the encoder and the outputs of the encoder are fed into the decoder as its input and I understood that:
Basically what this means is that the same network architecture is used for both translations by swapping the input/outputs datasets.
Hyperparameters
This model shows the importance of using strong processing units because training a NN with a CPU with that amount of data will take lots of months or will probably not be able to complete training due to interruptions and other factors. Here, the average training epoch with a strong GPU takes 27h while it is reduced to about 2h using a very strong TPU. This proves the importance of strong processing units in Machine Learning.
Model Testing 2.2 million molecules were used for testing and BLEU scoring was used for accuracies of the predictions as well as Tanimoto similarities. Of course the predicted IUPAC names as outputs were needed to be converted back to SMILES using OPSIN to be able to use Tanimoto similarity calculations for accuracy of those predictions. This was very interesting to me.
I also understood that the difference between training time of SELFIES-IUPAC name and IUPAC-SELFIES was as a result of the complexity of IUPAC names. IUPAC names contain more and complex strings so SELFIES-IUPAC name translation will take more training time since unpacking and reprocessing IUPAC names to SMILES will take more time.
It would be interesting to see the skeleton of this neural network architecture, the number of layers, units of each layer, activation function, the methods involved in reducing bias and variance. To know if regularisation was implemented if bias or variance was encountered.
This is a very interesting work and I would like to learn more and be involved in it so I have studied and installed the model to my local machine.
After successfully installing the STOUT model to my system, I was able to run predictions on my local machine. The steps to install and run the model are listed below:
pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
on my CLI.The usage of the model on my CLI is shown below:
from STOUT import translate_forward, translate_reverse 2023-03-17 10:08:20.131610: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name) IUPAC name of CN1C=NC2=C1C(=O)N(C(=O)N2C)C is: 1,3,7-trimethylpurine-2,6-dione IUPAC_name = "1,3,7-trimethylpurine-2,6-dione" SMILES = translate_reverse(IUPAC_name) SMILES = translate_reverse(IUPAC_name) SMILES = translate_reverse(IUPAC_name) print("SMILES of "+IUPAC_name+" is: "+SMILES) print("SMILES of "+IUPAC_name+" is: "+SMILES) SMILES of 1,3,7-trimethylpurine-2,6-dione is: CN1C=NC2=C1C(=O)N(C)C(=O)N2C
from STOUT import translate_forward, translate_reverse imports the two functions to perform both translations as explained in the publication, the translate_forward function translates SMILES to IUPAC, SMILES strings are fed into the neural network and the corresponding IUPAC name is produced as output. Vice versa, the translate_reverse translates IUPAC to SMILES, the more complex IUPAC string is fed into the neural network and produces corresponding SMILES as output. I'm guessing the transformation of SMILES to SELFIES and SELFIES back to SMILES will take place in the functions, the SMILES are transformed to SELFIES because of it less complex form and are easier to be unpacked when being fed into the neural network.
This must be a multi-label-multi-class classification, I would like to see how the multi labels were generated and put together because I was also working on a similar model before the start of this internship so this will provide more insight.
I will now run the model on Ersilia model hub to compare the results. I will be using Google Colab as I've been doing because I wasn't able to fetch the initial model on my local machine due to segmentation faults.
I have fetched the model corresponding to the STOUT model on the Ersilia Model Hub using Google Colab, the first five predictions are shown below:
key \
0 MCGSCOLBFJQGHM-SCZZXKLOSA-N
1 GZOSMCIZMLWJML-VJLLXTKPSA-N
2 BZKPWHYZMXOIDC-UHFFFAOYSA-N
3 QTBSBXVTEAMEQO-UHFFFAOYSA-N
4 PWKSKIMOESPYIA-BYPYZUCNSA-N
input \
0 Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1
1 C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1
2 CC(=O)Nc1nnc(S(N)(=O)=O)s1
3 CC(=O)O
4 CC(=O)NC@@HC(=O)O
iupacs_names
0 [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
1 (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
2 N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
3 aceticacid
4 (2R)-2-acetamido-3-sulfanylpropanoicacid
The results of the Ersilia model are somewhat similar to the STOUT results in a way that it converts molecules represented as SMILES to IUPAC names but it doesn't convert IUPAC names back to SMILES. The model took approximately 22.342 minutes to make 442 SMILES predictions even while ran on a GPU. It shows the importance of a good processing unit when training and even making predictions.
I would like to see IUPAC-To-SMILES translation also incorporated into this model and be part of it.
I have tried to predict the first five SMILES contained in Ersilia's SMILES file on the STOUT model. I did this to properly compare the results of both models and the codes and outputs are provided below:
from STOUT import translate_forward, translate_reverse 2023-03-17 13:14:44.147335: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILE1 = 'Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1' SMILE2 = 'C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1' SMILE3 = 'CC(=O)Nc1nnc(S(N)(=O)=O)s1' SMILE4 = 'CC(=O)O' SMILE5 = 'CC(=O)NC@@HC(=O)O'
IUPAC_name1 = translate_forward(SMILE1) IUPAC_name2 = translate_forward(SMILE2) IUPAC_name4 = translate_forward(SMILE4) IUPAC_name3 = translate_forward(SMILE3) IUPAC_name5 = translate_forward(SMILE5)
print("----------SMILES---------" + "\n" + SMILE1 + "\n" + SMILE2 + "\n" + SMILE3 + "\n" + SMILE4 +"\n" + SMILE5 )
----------SMILES--------- Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1 C[C@]12CC[C@H]3C@@H[C@@H]1CC=C2c1cccnc1 CC(=O)Nc1nnc(S(N)(=O)=O)s1 CC(=O)O CC(=O)NC@@HC(=O)O
print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 )
----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid
The output results of the first five SMILES strings in the STOUT model are not completely the same as the output results of the first five SMILES strings in the Ersilia model as shown above.
These inputs, SMILES (Simplified Molecular Input Line Entry System), are more concise forms of line representations of molecular structures of chemical compounds that are primarily designed to be understood by machines.
The corresponding outputs are the IUPAC names of the molecular structures of the chemical compounds. IUPAC names follow an established set of rules for the chemical nomenclature of the molecular structures of chemical compounds.
The first, second and third predictions are different while the rest of the predictions are the same. Although the first predictions from both models are almost alike. I wonder if they are just both different forms of names of the chemical compounds or the model didn't accurately predict them.
Hello @AdedejiAdewole The results of the model must be the same, it is the same code. You must pass SMILES as input to the original model (STOUT). From the eml_canonical.csv file it must be the column with the name "smiles" not "can_smiles" (canonical smiles), the model STOUT alredy process the input. If you pass the firts molecule like "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", the result of the both model should be the same.
Hello @carcablop Thank you for your insights. I have checked the "smiles" column and changed the column name to that, this is the first SMILE in it- "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1". The STOUT model still gives different outputs from the Ersilia implementation on Google Colab.
The picture below shows the data columns in the eml.canonical.csv file:
Hi @AdedejiAdewole It is strange that this is a different output. I have decided to test the original model (STOUT) passing it a molecule as input, the same inputs that you have shared and I get the same output that you have shared from google colab. That is to say, it gives me the same output as a result. I share a log of my output passing the molecule: "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1": ` IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 is: [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol`` Even if I pass it the molecule from the can_smile column it also gives the same result.
This is the complete output from the STOUT model (original model) passing the entire eml_canonical file to it. out_predictions(2).csv
Can you give a detailed explanation of the steps you have done to obtain predictions from the STOUT model?, for example if you have created a script read the input file?, Can you also provide details of your environment created to run the model?.
Okay @carcablop I installed the STOUT model using 'pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git'. Went on to running python using 'python3' then ran the python codes in this order:
from STOUT import translate_forward, translate_reverse This produced this message _2023-03-17 22:08:50.939921: I tensorflow/core/platform/cpu_featureguard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags Then it prompted me to enter the next line of code. SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
This was the output 'IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 is: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol'
I installed it with the command pip install STOUT-pypi,
at that time the version was 2.0.1. (in a conda environment with python 3.7) If we look at the version history here: https://pypi.org/project/STOUT-pypi/#history
You can see that they are in version 2.0.5. I think this is what would be making the difference.
Okay, mine is also version 2.0.5 actually.
Yes, and the model uses version 2.0.1.
Okay so are you suggesting that the two different versions will produce different outputs?
Hi @carcablop and @AdedejiAdewole
Thanks for these tests! It might be they have updated the translator from previous to the newest version. Regarding the translation from IUPAC to SMILES; the issue is that Ersilia at this moment is not accepting text as input, only smiles. This feature will be implemented soon! @AdedejiAdewole aside fromt tackling week 3 tasks, might I ask you to try installing the other version (the one Ersilia runs) and see if the output now coincides? we might want to bump Ersilia's model version to the latest one
Hello @GemmaTuron Thank you for your response. I have installed and ran Ersilia's version of the STOUT model using Colab. It still gives the same different outputs when compared to the original model run on my local machine.
The original model outputs are: ----------SMILES--------- Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1 C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 CC(=O)Nc1sc(nn1)S(=O)=O CC(O)=O CC(=O)NC@@HC(=O)O ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid
Ersilia's outputs
I have tried to predict the first five SMILES contained in Ersilia's SMILES file on the STOUT model. I did this to properly compare the results of both models and the codes and outputs are provided below:
from STOUT import translate_forward, translate_reverse 2023-03-17 13:14:44.147335: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. SMILE1 = 'Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1' SMILE2 = 'C[C@]12CC[C@H]3C@@H[C@@h]1CC=C2c1cccnc1' SMILE3 = 'CC(=O)Nc1nnc(S(N)(=O)=O)s1' SMILE4 = 'CC(=O)O' SMILE5 = 'CC(=O)NC@@HC(=O)O' IUPAC_name1 = translate_forward(SMILE1) IUPAC_name2 = translate_forward(SMILE2) IUPAC_name4 = translate_forward(SMILE4) IUPAC_name3 = translate_forward(SMILE3) IUPAC_name5 = translate_forward(SMILE5) print("----------SMILES---------" + "\n" + SMILE1 + "\n" + SMILE2 + "\n" + SMILE3 + "\n" + SMILE4 +"\n" + SMILE5 ) ----------SMILES--------- Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1 C[C@]12CC[C@H]3C@@H[C@@h]1CC=C2c1cccnc1 CC(=O)Nc1nnc(S(N)(=O)=O)s1 CC(=O)O CC(=O)NC@@HC(=O)O print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 ) ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid
The output results of the first five SMILES strings in the STOUT model are not completely the same as the output results of the first five SMILES strings in the Ersilia model as shown above.
These inputs, SMILES (Simplified Molecular Input Line Entry System), are more concise forms of line representations of molecular structures of chemical compounds that are primarily designed to be understood by machines.
The corresponding outputs are the IUPAC names of the molecular structures of the chemical compounds. IUPAC names follow an established set of rules for the chemical nomenclature of the molecular structures of chemical compounds.
The first, second and third predictions are different while the rest of the predictions are the same. Although the first predictions from both models are almost alike. I wonder if they are just both different forms of names of the chemical compounds or the model didn't accurately predict them.
Hello @AdedejiAdewole, As you can see from the output of both models (SMILES to IUPAC names), the original model and the one available on The Ersilia Model Hub give the same translations. For the first 5 miles in the eml dataset, Your output of the original model from the authors' repository is
print("----------IUPAC NAMES---------" + "\n" + IUPAC_name1 + "\n" + IUPAC_name2 + "\n" + IUPAC_name3 + "\n" + IUPAC_name4 +"\n" + IUPAC_name5 )
----------IUPAC NAMES---------
[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
aceticacid
(2R)-2-acetamido-3-sulfanylpropanoicacid
For the Ersilia model Hub model, you reported output as
----------IUPAC NAMES---------
[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide
aceticacid
(2R)-2-acetamido-3-sulfanylpropanoicacid
As you can see, the two models give the same translations.
The translations are not that accurate when compared with the correct IUPAC names, i.e
abacavir
abiraterone
acetazolamide
acetic acid
acetylcysteine
This just means the Model is not is performing well on the test data. So these results(human evaluation) can be used to study the translations and fine-tune the model on more diverse data to improve the model's performance.
Hello @HellenNamulinda If you check my earlier comments well, you'll see that those are the outputs of the original and I provided different outputs from the Ersilia model. Thank you for your feedback, really important you checked and thought to make a comment.
Hello @GemmaTuron Thank you for your response. I have installed and ran Ersilia's version of the STOUT model using Colab. It still gives the same different outputs when compared to the original model run on my local machine.
The original model outputs are: ----------SMILES--------- Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 CC(=O)Nc1sc(nn1)S(=O)=O CC(O)=O CC(=O)NC@@HC(=O)O ----------IUPAC NAMES--------- [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide aceticacid (2R)-2-acetamido-3-sulfanylpropanoicacid
Ersilia's outputs
- Nc1nc(NC2CC2)c3ncn([C@@h]4CC@HC=C4)c3n1 | [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
- C[C@]12CCC@HCC1=CC[C@@h]3[C@@h]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
- CC(=O)Nc1sc(nn1)S(=O)=O | N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide
- CC(O)=O | aceticacid
- CC(=O)NC@@HC(O)=O | (2R)-2-acetamido-3-sulfanylpropanoicacid
Hi @AdedejiAdewole, You're right. I have seen the very first sample and the translations are slightly different. It seems to happen especially for some SMILES that are longer. My thinking is that this is happening because the versions are different.
Yes @HellenNamulinda I suspect that too.
Hi both,
IUPAC is a set of rules to for organic compounds. In some cases, the rules might be interpreted slightly different giving rise to small differences, but I think the model implementation is fine! Thanks for your tests, let's move onto week 3 tasks now
DTI Prediction Model
Python generated DL framework that takes SMILES string and protein amino acid pairs as input. They are fed into molecular encoders to convert compounds and proteins to their corressponding vector representations. These embedded representations are fed into the decoder to generate prediction outputs, these outputs are continuous binding scores and binary outputs indicating if a protein binds to a compound. A detection happens if a task is a regression or classification and uses the correct loss function and evaluation method.
The interesting part of this model is the range of encoders, you can switch to required encoder model and connect to the decoder for predictions. It provides a range of 8 compound encoders and 7 protein encoders.
drug-target-prediction
Target identification, Embedding
https://doi.org/10.1093/bioinformatics/btaa1005
More information of the model and different encoders are provided in this pdf below
DeepPurpose_BIOINFO_SUPP (1).pdf
https://github.com/kexinhuang12345/DeepPurpose
BSD 3-Clause License
Hi @AdedejiAdewole !
This is a good example, we already have in our list of to does actually :) This is a large study with a lot of applications, so we would be starting by the Pretrained models, see if we can load and run them. Let's first find another model suggestion and if there is time we might try out the implementation!
Alright thank you @GemmaTuron Are we allowed to suggest models that has its implementation and usage on gitlab and not GitHub?
PlasmidHunter: Accurate and fast prediction of plasmid sequences using gene content profile and machine learning
Contrary to viruses, plasmids are extrachromosomal pieces of naked, double-stranded DNA that can spread within a host cell. Notwithstanding their advantages and importance as tools for gene therapy, medication development, and genetic engineering, they may be harmful to humans. For example, plasmids play a key role in causing antimicrobial resistance (AMR) among related bacterial species i.e enabling resistance to many commonly used antibiotics such as tetracycline and penicillin. Plasmids could also transmit virulence, toxicity and pathogenicity to a wider group of bacteria.
PlasmidHunter was created to serve as an identification tool that uses gene content profile alone as the feature to predict plasmid sequences with no need for the raw sequence data, sequence topology and coverage or assembly graph.
Input- any assembled sequence file produced by any modern high-throughput sequencer and assembled by any algorithm Output- Chromosomal or plasmid origin of the contigs Programming Language- Python
plasmid-hunter
Target identification
https://www.biorxiv.org/content/10.1101/2023.02.01.526640v1.full
More information of the model are provided in this pdf below
https://github.com/tianrenmaogithub/PlasmidHunter
GPL-3.0 license
Hi @AdedejiAdewole !
That's a nice model, but currently out of scope of Ersilia, since we focus on the drug discovery process and we are not dealing at this moment with genomic data! Let's try to find a third model that uses chemistry data instead of proteomics or genomic data
Terpenes: The chemical space of Terpenes
Terpenes are a wide range family of naturally occurring substances with different types of chemical and biological properties. Many of these molecules have already found use in pharmaceuticals. Characterisation of these wide range of molecules with classical approaches has proved to be a daunting task. This model provides more insight to identifying types of terpenes by using a natural product database, COCONUT to extract information about 60,000 terpenes. For clustering approach to this dataset, PCA, FastICA, Kernel PCA, t-SNE and UMAP were used as benchmark. For classification approach, Light gradient boosting machine, k-nearest neighbors, random forests, Gaussian naiive Bayes and Multilayer perceptron were used. The best performing algorithms yielded accuracy, F1 score, precision and other metrics all over 0.9.
Input- Terpene features Output- Chemical subclass Programming Language- Python
terpenes
Target identification
https://arxiv.org/abs/2110.15047
More information of the model are provided in this pdf below
https://github.com/smortezah/napr
MIT
We are very interested in Natural products @AdedejiAdewole ! Can you add this model in our model suggestion list? And while you start preparing your final application, would you want to try and install this latest model, see if it is easy to implement? Thanks!
Good morning @GemmaTuron Sorry for the late response, something came up. I will do that now.
I have added the model to the model suggestion list. I am trying too install and implement the model now.
Hello @GemmaTuron. I have been on the Terpene model and encountered some issues running "pytest napr" to test napr after installation. These issues are related to my version of python and there is no python 3.10 version for Mac intel, some of these issues are:
After doing the steps above, I was able to test napr with "pytest napr" and the test was successful.
(base) adewoleadedeji (master #) ~ $ pytest napr ============================= test session starts ============================== platform darwin -- Python 3.9.16, pytest-7.2.2, pluggy-0.12.0 rootdir: /Users/adewoleadedeji/napr, configfile: pyproject.toml plugins: anyio-3.5.0 collecting ... 2023-03-25 12:56:16.536824: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. collected 55 items
napr/napr/apps/coconut/terpene/tests/test_base_terpene.py ... [ 5%] napr/napr/apps/coconut/terpene/tests/test_explore.py ..... [ 14%] napr/napr/apps/coconut/terpene/tests/test_preprocessing.py ... [ 20%] napr/napr/data/tests/test_base_data.py . [ 21%] napr/napr/data/tests/test_load.py .. [ 25%] napr/napr/evaluation/tests/test_classification.py .......... [ 43%] napr/napr/hyperopt/tests/test_base_hyperopt.py .... [ 50%] napr/napr/plotting/tests/test_base_plotting.py .. [ 54%] napr/napr/utils/tests/test_decorators.py . [ 56%] napr/napr/utils/tests/test_helpers.py ............. [ 80%] napr/napr/utils/tests/test_random.py ...... [ 90%] napr/napr/utils/tests/test_stat.py ..... [100%]
============================= 55 passed in 14.15s ==============================
Now I am trying to see if this model can be implemented and used so I can properly understand the inputs and outputs.
I have tested the Terpene model on Jupyter using the Jupyter notebook provided in the Napr repo. Although I duplicated the repo after cloning so I don't mess up the structure of the original repo. The model works well and gives high accuracies. I had some import issues so I had to put the necessary python files in the right locations and all that but I got it to work and run.
The Terpene dataset was gotten from the COCONUT dataset (a natural product dataset). Only entries that belonged to the SuperClass “Lipids and lipid-like molecules” were selected and molecules that belonged to one of the following SubClasses: “Diterpenoids”, “Sesquiterpenoids”, “Monoterpenoids”, “Polyterpenoids”, “Sesquaterpenoids”, “Sesterterpenoids”, “Terpene glycosides”, “Terpene lactones” and “Triterpenoids” was further filtered.
Categorical features (“textTaxa”, “bcutDescriptor”, “chemicalClass”, “chemicalSubClass”, “chemicalSuperClass” and “directParentClassification”) were first transformed. For example, the "textTaxa" feature was encoded by creating four new columns (“plants”, “marine”, “bacteria” and “fungi”) with the 1 assigned to a molecule's column if a corresponding taxonomy was present and 0 if absent. The "bcutDescriptor" contained arrays of six float numbers so they were all split and expanded into 6 separate columns. Terpenes belong to the same chemical class and chemical super class so "chemicalClass" and "chemicalSuperClass" so they were not needed. "chemicalSubClass' was the target so it was;nt encoded. The “directParentClassification” feature, contained 111 values, they were encoded by the integers 0 to 110. The dataset was split with 75% assigned to the training set and 25% to the test set. Missing data were filled using an imputer median method. Standardisation was also carried out.
This model was trained with different types of methods (Light gradient boosting machine, k-nearest neighbours, random forests, Gaussian naïve Bayes and Multilayer perceptron) with the best-performing algorithms yielding accuracy, F1 score, precision and other metrics all over 0.9. The xgboost algorithm performed best and this was selected to train the model. Hyper parameter optimisation(tuning) was also carried out on this model and even yielded better accuracy with the best hyper parameters selected. The hyper parameter optimisation went through five trials and the best tuner yielded an accuracy of almost 100%. All the tuning results and the corresponding pickle file of the model were saved in a folder as tuning was going on.
I saved the predictions of the test data into a csv file provided below.
My Jupyter notebook where I was able to implement and understand the processes of achieving this model is provided below.
Terpene-classification-classic.ipynb.zip
Hi @AdedejiAdewole
That's great thanks. I was not aware Python 3.10 was not available for Mac Intels? I thought it was only lower python versions that did not have support. Have you checked the stable releases?
If you have time @AdedejiAdewole it would be great as well if you can have a look at this issue and let us know if the problems are persisting! https://github.com/ersilia-os/ersilia/issues/384
Good morning @GemmaTuron. Hope you had a great weekend. Yes, I did check the stable releases, the latest python version that was released for Mac intel was python 3.9.13 according to what is on that link you sent. The other versions are of macOS 64-bit universal2 installer, I'm not sure this would be supported on Mac intel at this moment.
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application