ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Samuel Maina #631

Closed samuelmaina closed 1 year ago

samuelmaina commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

samuelmaina commented 1 year ago

My Motivation to work at Ersilia: Hoping you are fine. I graduated in Dec, 2022 at Moi University in Kenya with a BSc. Computer Science. I have been facinated by computers and how fast and accurate they (if given correct instructions). If humanity can harness that power then we would make the world a better place and more abundant. I have experienced hard living conditions and severe poverty during my life time. I have seen people die in my community due to lack of medicine and vaccines which are readily available in other part of the world. The health conditions for Africa and other third world countries need to be acted upon in total seriousness. I have developed few machine learning model such as image classifier, prediction of buying behaviors based on income information and demographics. I have practiced paradigms in python such as Clustering and Classification and different algorithms such as k-means, random forests etc. I have used common packages such numpy, pandas, sklearn and many other for newtorking, drawing graphs & plots etc. I hope contributing to the Ersilia Community will at least open my eyes on what can be done . Ersilia is a huge community with advanced technologies and algorithms I hope to learn advanced AL/ML so as to increase my knowledge and skills. I am eager to learn, collaborate and contribute to the Ersilia community during my internship. Thank you for your consideration.

GemmaTuron commented 1 year ago

Hi @samuelmaina Welcome to Ersilia, great to have you here! Please let us know which system are you using and whether you had any issues installing Ersilia. When you are done, check this issue and see if the bug Ahmed is encountering is specific to his system or it also happens to you! Please work together to make sure this model is working :)

samuelmaina commented 1 year ago

I am using wsl2(window 10) for Ubuntu 20.04 LTS. I had one error during the installation which I have raised as a bug at issue I was able to run the sample model. I will look into issue and get back to you.

samuelmaina commented 1 year ago

Used the 2 files(run and the list_run) provided by @pauline-banye . The model run successfully and provided the following out put. run_output.csv and list_run_output.csv

GemmaTuron commented 1 year ago

Thanks @samuelmaina for the tests! Sorry, closed the issue inadvertently

samuelmaina commented 1 year ago

Hello @GemmaTuron,I am running STOUT model. The ersilia model hub and the github model have different predictions for the smiles and can-smiles. I extracted the smiles and the can-smiles data from the eml_canonical.csv data provided in the contribution guide. I then made predictions from the two sets of data and make prediction using the STOUT module. I used python to carry out the steps. I choose the model because it uses deep learning using neural networks to make prediction . The model was trained with billions of smiles labelled with their UIPAC names. The model was able to co-relate the smile string structure with the UIPAC names. IUPAC name generation has a lot of algorithmic complexity and large set of rules which makes very hard to code all the rules into program generators. The model has only two function the forward_translation(which is used to give the UIPAC name of the smiles) and the backward_translation(which gives the smiles names from given smiles). It has an BLEU score of about 90% and a Tanimoto similarity index of more than 0.9 according to those who trained it. It did not given any confidence level as an output when I run it. smiles.csv can_smiles.csv The python code:

from STOUT import translate_forward, translate_reverse
import csv
import json

def read_data_from_file(path):
    result = []
    with open(path, 'r') as file:
        csvreader = csv.reader(file)
        for row in csvreader:
    return result

def write_data_to_json_file(output_file, data):
    with open(output_file, 'w') as f:
        json.dump(data, f)

def separate_smiles_and_can_smiles_into_separate_files():
    with open("smiles.csv", 'w') as file_1:
        with open("smiles_can.csv", 'w') as file_2:
            with open("eml_canonical.csv", 'r') as file_3:
                csvreader = csv.reader(file_3)
                writer_2 = csv.writer(file_2)
                writer_1 = csv.writer(file_1)
                for row in csvreader:

def get_uipac_name_from_smiles(smiles: list, can_smiles: list):
    smiles_uipac_names = []
    can_smiles_uipac_names = []
    n = 10
    for i in range(n):
        smiles_uipac = translate_forward(smiles[i])
        can_smiles_uipac = translate_forward(can_smiles[i])
            "smile": can_smiles[i],
            "UIPAC_name": can_smiles_uipac
            "can_smile": smiles[i],
            "UIPAC_name": smiles_uipac
        print(i+1, "done out of ", n, " currently at ", (i+1)/n * 100, "% done")
    return smiles_uipac_names, can_smiles_uipac_names

# separate the smiles and can-smiles into different csv files from the eml_canonical.csv
smiles = read_data_from_file("smiles.csv")
can_smiles = read_data_from_file("can_smiles.csv")
smiles_output, can_smiles_output = get_uipac_name_from_smiles(
    smiles, can_smiles)
write_data_to_json_file("smiles_output.json", smiles_output)
write_data_to_json_file("can_smiles_output.json", can_smiles_output)

Here are input and results from the STOUT repository code (the one I run locally):

GemmaTuron commented 1 year ago

Hi @samuelmaina !

Thanks for the good work! Before moving onto week 3 tasks, can you please provide a bit more of explanation on the model? What does it mean " an BLEU score of about 90% and a Tanimoto similarity index of more than 0.9 according to those who trained it." And, can you tell us if the output of the model you run in your system and the model fromt he Hub are the same rather than just pasting the whole outcomes? You can choose one or two smiles as an example, perhaps

samuelmaina commented 1 year ago

Thanks. I mentioned that the ersilia hub model was producing different results from the local model but maybe its slipped you due to the my phrasing. Sorry for the text output, I thought you needed some comparison. The stout model is simply used to translate the smiles into their UIPAC names and vice vasa. Infact, the model has only two function for doing exactly this. The model uses use Machine Translation Engine which is a special form of natural language processing translator between different languages using neural networks.BLEU(BiLingual Evaluation Understudy) score is a measure that is used to see how professional a text has been translated. It is not machine generated rather a professional does translates a piece of text and the the model is scored on what percentage translation it gets right. The model's 90% BLUE score means the model correctly translated 90% UIPAC names for the given smiles. Tanimoto similarity index of 0.9 means that the incorrect translated smiles were 90% similar to the true compound.

samuelmaina commented 1 year ago

The model does not produce any error rate or confidence level for the input data. The BLEU score was done by the Model creators and they are doing it the BLEU test periodically to make sure accuracy for new compounds

samuelmaina commented 1 year ago

For the internal implementation. The model first converts the compounds into tokens using natural language tokenizer. The tokens are classified into compounds and the bonds. The tokens are fed into an neural network which is used to make the prediction.

GemmaTuron commented 1 year ago

Hi @samuelmaina

Thanks, whenever you refer to scores etc it is best to explain them, specially if you are using acronyms, as you don't know if the reader will know about them or not! can you do this last check: And, can you tell us if the output of the model you run in your system and the model from the Hub are the same rather than just pasting the whole outcomes? You can choose one or two smiles as an example, perhaps before moving onto week 3 tasks? Thanks!

samuelmaina commented 1 year ago

The ersilia hub model produced different results for the model that I run locally. I run 10 test smiles and the two models with all the UIPAC names being different for the all the 10 samples. I will show results for 2 smiles.

  1. Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 smiles

    • local model output: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
      • Ersilia hub model output: [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
      • I looked for the correct UIPAC name from this smiles in the Protein Data Bank in Europe site(which is certified and run by professionals) and their output correspond to the result from the local model which means that the ersilia model is wrong. You can verify the UIPAC name from to the smiles section to confirm the smiles structure corresponds to the current smiles.
      • Theres is a slight difference in the first chiral center and the positioning of the cyclopropylamino group in the purine ring. Local model specifies that the stereochemistry at the first chiral center is "S" (left-handed) while ersilia model specifies first stereocenter is "R"(right-handed). Local model places the cyclopropylamino group in the 4th positions while the ersilia model places it at the 6th position.
  2. C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 smiles

    • local model output: (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
    • Ersilia hub model output: (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[,6.010,14]pentadeca-7,16-dien-14-ol
    • The ersilia model is wrong. The correct name can be found at the Protein Data Banks in Europe at
    • The models differ in the number of chiral centers in the "S" configuration. The local model specifies six chiral centers in the "S" configuration. Ersilia model specifies four chiral centers; the first, second, fifth, tenth, eleventh, and fourteenth chiral centers in the "S" configuration.

From the two smiles, the Ersilia model has two minor misplacements in the naming which produces erroneous UIPAC names. As of now, I can't predict where the minor errors in the ersilia model emerge from.

GemmaTuron commented 1 year ago

Hi @samuelmaina !

This would be due to a small update in the STOUT version. Can you tell us which version is the STOUT now and which one is in the Hub? You can check by activating the relevant conda environments

samuelmaina commented 1 year ago

The local model is version 2.0.5 but ersilia does not show the model version. I have tried to follow the github link in the ersilia model description but it points to the current git directory of the STOUT repo. I think its because ersilia hasn't fetched the latest code from github

(STOUT) samuelmayna@SAM:~/stout_project$ pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.5
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Author: Kohulan Rajan
License: MIT
Location: /home/samuelmayna/miniconda3/envs/STOUT/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2

ersilia model --version:

(ersilia) samuelmayna@SAM:~/stout_project$ ersilia card smiles2iupac
    "Identifier": "eos4se9",
    "Slug": "smiles2iupac",
    "Status": "Ready",
    "Title": "STOUT: SMILES to IUPAC name translator",
    "Description": "Small molecules are represented by a variety of machine-readable strings (SMILES, InChi, SMARTS, among others). On the contrary, IUPAC (International Union of Pure and Applied Chemistry) names are devised for human readers. The authors trained a language translator model treating the SMILES and IUPAC as two different languages. 81 million SMILES were downloaded from PubChem and converted to SELFIES for model training. The corresponding IUPAC names for the 81 million SMILES were obtained with ChemAxon molconvert software.\n",
    "Mode": "Pretrained",
    "Input": [
    "Input Shape": "Single",
    "Task": [
    "Output": [
    "Output Type": [
    "Output Shape": "Single",
    "Interpretation": "IUPAC name of a specific SMILES",
    "Tag": [
        "Chemical notation",
        "Chemical language model"
    "Publication": "",
    "Source Code": "",
    "License": "MIT",
    "Contributor": "carcablop"
samuelmaina commented 1 year ago

Maybe you can suggest use of automatic redeployment after the github code is integrated and tested using devOps so that the two can be in sync.

GemmaTuron commented 1 year ago

Hi @samuelmaina

To check the python version of a package you need to activate the conda environment, and then for example call conda list This will tell you which version of stout is Ersilia running You can also check the requirements of the ersilia model, what you pasted above is simply the metadata file which does not contain versioning history

samuelmaina commented 1 year ago

Thanks for the correction. Ersilia is running 2.0.1 version . I activated the model itself using its code and run conda-list.

GemmaTuron commented 1 year ago

Thanks @samuelmaina

we will update this model in the near future when we incorporate the reverse translation from iupac to smiles Let's move to week 3 tasks!

samuelmaina commented 1 year ago

That would be really good as the model would be utilized fully.

samuelmaina commented 1 year ago








Open is a toolkit for molecular simulation. It can be used either as a stand-alone application for running simulations, or as a library you call from your own code. It provides a combination of extreme flexibility , openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.


Python, C, C++, and Fortran


Conda env Python 3.x

SUMMARY AND RELEVANCE TO ERSILIA'S MISSION Molecular simulations can accelerate drug discovery by reducing the time and cost associated with traditional experimental approaches. By using simulations to screen large libraries of compounds, researchers can quickly identify promising drug candidates for further testing and development, thereby streamlining the drug discovery process. In drug discovery, molecular simulations can help researchers to design and optimize drug candidates that target specific proteins or other biomolecules involved in disease pathways. For infectious and neglected diseases, this can be particularly useful because many of these diseases are caused mainly caused by pathogens that have evolved to evade traditional drug therapies hence making it hard for researchers to combat them. Simulations can be ran fast and many models can be made which will accelerate the path to finding cure for this diseases. Simulation can also be used to reduce drug side effects. Simulations can help to identify and optimize small molecule drug candidates that can bind with drug creating high affinity and specificity, thereby reducing the likelihood of side effects and increasing efficacy. The model support a variety of programming languages hence it will have a great audience in ersilia.

TASK Simulation


TAGS Simulation Drug research

samuelmaina commented 1 year ago

@GemmaTuron Should I add more details?

GemmaTuron commented 1 year ago

Hi @samuelmaina !

Thanks, a few comments:

Looking forward to the next articles!

samuelmaina commented 1 year ago

Thanks very much for the feedback. I will be keen next time on the model descriptions.

samuelmaina commented 1 year ago

@GemmaTuron , I found this model and it does not have a trained model but its code seems to produce very promising model on Drug Interaction Prediction. I was thinking of ersilia using the code to create the model and then serving the model. Should I include it?

GemmaTuron commented 1 year ago

yes, if the data is available we can re train it

samuelmaina commented 1 year ago




SumGNN is a trainer for Drug-Drug Interaction(DDI) prediction which incorporates knowledge summarization graph neural network an improvement from tradition KG(Knowledge Graph) network used in present base trainer. Models produced by SumGNN produced better pharmacological effect prediction score from other trainer by 5.57%. DDI prediction is critical in determining side effects of drugs in people with pre-existing conditions.pharmacological effect prediction score is used to measure the adversity of interaction of two or more drugs. The DDI model can be used to rule out some medicines for patients or suggest change in drug's chemical composition to suite a patient. This will accelerate drug discovery by offering faster and accurate reaction feedback.


Above dataset are used to train a predicting model in the trainer.




Chemistry Simulation







samuelmaina commented 1 year ago


RXNFP - chemical reaction fingerprints


rxnfp is a model that generates reaction fingerprints(reaction classes with the same reactions with similar reagents and mechanisms). fingerprints enable communication of complicated chemical reaction among scientists. rxnfp model allows for fast fingerprint reaction as the traditional fingerprinting is tedious and long. The model can allows for visual clustering of data and allows for fast similarity searches among chemical reactions.

The model allows scientists to share knowledge faster and faster study of reactions due to presented similarities. This will accelerate drug discovery,











GemmaTuron commented 1 year ago

Thanks @samuelmaina !

Can you add those to the model suggestion list? And as next steps, would you be up while you prepare your final application, to try and run the rxn-fingerprint? let's see if it would be easy to incorporate in the Hub!

samuelmaina commented 1 year ago

Have added the model . I have one pending model but I will run the rxn-fingerprint as soon I am done with the third. Thank you

samuelmaina commented 1 year ago




molscore is a model for autoscoring de-novo compound. The scoring is used for evaluation of de novo molecular design when the model are shared among scientists. De novo designs are useful in exploration of broader chemical space, they have improved therapy which improves drug candidates in a cost and timely manner




Scoring De Novo Designs







GemmaTuron commented 1 year ago

HEy @samuelmaina !

MolScore seems very useful for generative models in reinforcement learning (for example, REINVENT). Won't be able to add it to the Hub directly because it needs to be coupled to a generative model but I definitely will see if we can implement it for some of our projects, thanks!

If you have time, test the rxn-fps as next steps :)

samuelmaina commented 1 year ago

Thanks very much madam @GemmaTuron . I am running the model . Will give you the results when am done. Thank very much

samuelmaina commented 1 year ago

@GemmaTuron Run rxn-fps model successfully. Produced results as expected. I run into git ssh issues when downloading the code from github but that can be solved by google(bing) searcn. Got the desired results.

(rxnfp) samuelmayna@SAM:~/rxnfp_project$ python
[-2.0174951553344727, 1.760203242301941, -1.3323537111282349, -1.1095025539398193, 1.2254549264907837]
samuelmaina commented 1 year ago

Hello @GemmaTuron , Any task I can do?

GemmaTuron commented 1 year ago

Hi @samuelmaina !

It would be fantastic if you can share a bit more about the steps you took to running rxn so that we can reproduce it (like, which packages did you have to install etc)

GemmaTuron commented 1 year ago

Also @samuelmaina it would be great if you can have a look at this issue and let us know if the problems are persisting!

samuelmaina commented 1 year ago

For rxn, I used conda. I ran the steps outline in the github repo which are:

conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
git clone
cd rxnfp
pip install -e .

rdkit module is used to handle compounds and their reactions. tmap is used for visual presentation of compounds.

All the steps ran smoothly except for git clone that used SSH(a means of secure communication to ensure the source and the destination are legitimate). I didn't have the SSH keys but after researching in blogs and articles the issue was resolved. One can use the normal git clone using git clone which will not have problems. To run the the starter code, run python in with containing the example code in the readMe. To see
other functionalities one should visit

samuelmaina commented 1 year ago

@GemmaTuron The final application is asking for answer for specific questions. Do you have any ersilia specific questions? Please can you provide me with the project timeline during the internship?

GemmaTuron commented 1 year ago

Hi @samuelmaina ! I've added some info on the Slack channel for the application. Do you want to go ahead and try to incorporae rxn in the Ersilia Model Hub in parallel to preparing the application? check the steps for it in our documentation and let me know!

samuelmaina commented 1 year ago

Yes @GemmaTuron . I will incoporate it. I will also run the model in the issue. Thank very much.

samuelmaina commented 1 year ago

@GemmaTuron I am gettting this error

Model API eos3ae7:predict did not produce an outputTraceback (most recent call last):
  File "/home/samuelmayna/eos/repository/eos3ae7/20230328090007_475C16/eos3ae7/artifacts/framework/code/", line 10, in <module>
    from chemvae.vae_utils import VAEUtils
  File "/home/samuelmayna/eos/repository/eos3ae7/20230328090007_475C16/eos3ae7/artifacts/framework/code/chemvae/", line 4, in <module>
    import yaml
ModuleNotFoundError: No module named 'yaml'

The model does not return an output and there is a problem with yaml dependency I have tried multiple times to install yaml in ersilia but it does not solve the issue eos3ae7_fetch.log

samuelmaina commented 1 year ago

I have also opened a model incorporation issue for the rxn-fingerpring model. Please can you take a look?

GemmaTuron commented 1 year ago

Hi @samuelmaina !

I've approved the rxn-fingerprint model, you can go ahead and start working on the repo. For the eos3ae7, thanks for testinvg, could you add this information in the issue for that model so that we can pick it up?

samuelmaina commented 1 year ago

Sure.I will start right away

samuelmaina commented 1 year ago

@GemmaTuron I am running the model in ersilia cli and I am getting this error:

Detailed error:
Model API eos6aun:run did not produce an outputTraceback (most recent call last):
  File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/", line 10, in <module>
    from rxnfp.transformer_fingerprints import (
  File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/rxnfp/", line 20, in <module>
    from .tokenization import (
  File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/rxnfp/", line 12, in <module>
    from rdkit import Chem
ModuleNotFoundError: No module named 'rdkit'm getting this error :

The instalation projess has not throw any any errors and I have included the neccesary dependencies in the dockerfile.

FROM bentoml/model-server:0.11.0-py37

RUN conda install -c rdkit rdkit=2020.03.3
RUN conda install -c tmap tmap
RUN pip install rxnfp

COPY . /repo

I have tried to put only one rdkt and one tmap but it's not working. What could be the problem?

GemmaTuron commented 1 year ago

2 comments / suggestions see if that helps:

samuelmaina commented 1 year ago

@GemmaTuron Really sorry for the late reply. After long debugging,the problem was using -c for caching to reduce download time for new downloads in conda install . I was able to serve the model successfully but I am having trouble prodicting with the eml_canonical.csv data it is returning None somewhere but I am working on it. Any advice? From the fetch_logs the single input produced the required output but for eml_canonical.csv I get this error.

    return_value = func(*args, **kwargs)
  File "/home/samuelmayna/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/", line 99, in wrapper
    return func(*args, **kwargs)
  File "/home/samuelmayna/ersilia/ersilia/cli/commands/", line 38, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/samuelmayna/ersilia/ersilia/core/", line 343, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/samuelmayna/ersilia/ersilia/core/", line 357, in api_task
    for r in result:
  File "/home/samuelmayna/ersilia/ersilia/core/", line 184, in _api_runner_iter
    for result in, output=output, batch_size=batch_size):
  File "/home/samuelmayna/ersilia/ersilia/serve/", line 330, in post
    results, output, model_id=self.model_id, api_name=self.api_name
  File "/home/samuelmayna/ersilia/ersilia/io/", line 283, in adapt
    df = self._to_dataframe(result)
  File "/home/samuelmayna/ersilia/ersilia/io/", line 229, in _to_dataframe
    output_keys_expanded = self.__expand_output_keys(vals, output_keys)
  File "/home/samuelmayna/ersilia/ersilia/io/", line 197, in __expand_output_keys
    t = self._guess_pure_dtype_if_absent(v)
  File "/home/samuelmayna/ersilia/ersilia/io/", line 181, in _guess_pure_dtype_if_absent
    return dtype["type"]
TypeError: 'NoneType' object is not subscriptable
samuelmaina commented 1 year ago

Solved the error above