Isaakkamau commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Isaakkamau commented 1 year ago

Issue: Install the Ersilia Model Hub and test the simplest model OSError: symbolic link privilege not held

Solution: I have solved the error by running conda as the administrator

Isaakkamau commented 1 year ago

Issue 2: ersilia fetch retrosynthetic-accessibility

(ersilia) C:\Windows\System32\ersilia>ersilia fetch retrosynthetic-accessibility
⬇️  Fetching model eos2r5a: retrosynthetic-accessibility
Checking setup: 3.731s
 12%|████████████████████▏                                                                                                                                            | 1/8 [00:03<00:26,  3.73s/it]�🚨🚨  Something went wrong with Ersilia 🚨🚨🚨

Error message:

Ersilia exception class:
ModelDeleteError

Detailed error:
Error occured while deleting model eos2r5a

Hints:
Check that the model is actually installed in your local device:
$ ersilia serve eos2r5a

If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)
 - You will find the console log file in: C:\Users\Isaac\eos/current.log
 12%|████████████████████▏                                                                                                                                            | 1/8 [00:18<02:06, 18.06s/it]

Solution: I am yet to solve this issue, Any suggestions are welcome!

I am using Windows 10, and I am doing installation using conda (run as administrator)

Isaakkamau commented 1 year ago

Hello, Everything is now fine when I changed my OS from windows to Ubuntu but I would also love to know how to solve the above error

Isaakkamau commented 1 year ago

I have successfully done the prediction using the command line!

Now I would like to do my prediction on my web browser using the Predict API, 127-0-0-1-50487 @GemmaTuron How can I change my input from .csv to a .json file format that model eos2r5a can understand?

GemmaTuron commented 1 year ago

Hi @Isaakkamau, Welcome to Ersilia :) We do not support development on Windows, only Linux and MacOS so that is why you were unable to run it, if you have a windows machine please use a Windows Subsystem for Linux. As for the online API, we seldom use it, but good that you want to test it. It should probably be something like: {"smiles": "CCCNOCCC"}

Isaakkamau commented 1 year ago

Hello, @GemmaTuron thanks the online predict API has also worked for me, If anybody else wants to try them you can recommend them to me!

Does Ersilia also support Docker Model deployment?

GemmaTuron commented 1 year ago

Hi @Isaakkamau

We are actually setting up the infrastructure to move all models to Docker containers, still work in progress, see issue #546 where Miquel and I are working

Isaakkamau commented 1 year ago

Noted! @GemmaTuron I am much familiar with model deployment using docker and FastAPI's, If any help is needed please let me know

Isaakkamau commented 1 year ago

Hello Ersilia,

My name is Isaak Kamau from Nairobi Kenya, currently graduated from the University of Nairobi with a degree in Mathematics (Statistics) and a couple of tech stacks like Tensorflow, Pytorch, Docker, and FastAPI. As someone who comes from an underrepresented background in the tech industry, I am impressed by Ersilia's efforts to create a diverse and inclusive workplace. I am excited by the prospect of working in an environment where my unique perspective and experiences will be valued and leveraged to drive innovation and progress.

Moreover, as an Outreachy participant, I chose to join Ersilia because I believe I have the skills required to contribute to their projects also Ersilia's mission and vision of bridging the gap between developing and developed countries in medical research is such a noble mission that I would love to participate and help to create even a more diverse and inclusive tech industry.

Thank you Ersilia, hoping to give my best.

Isaakkamau commented 1 year ago

Hello, @GemmaTuron I now want to move to week 2 contributions, Should I assign myself any model from the Ersilia model hub or do you have a specific one that I should try?

Isaakkamau commented 1 year ago

Hello @GemmaTuron I have started my week two contribution I have decided to start with maip-malaria-surrogate since I think Malaria is still one of disease that is really affecting us here in Africa Here is the output I am getting:

(ersilia) isaakmwangi@DESKTOP-O9Q8PKD:~$ cd ersilia
(ersilia) isaakmwangi@DESKTOP-O9Q8PKD:~/ersilia$ ersilia fetch maip-malaria-surrogate
⬇️  Fetching model eos2gth: maip-malaria-surrogate
Checking setup: 1.010s
Preparing model: 6.107689142227173s
Getting model: 14.707326173782349s
Packing model: 323.38919615745544s
Checking if model needs to be integrated to a tool: 0.0036895275115966797s
Getting model card: 1.2480900287628174s
Checking that autoservice works: 8.294064044952393s
Sniffing model: 31.705535411834717s
100%|█████████████████████████████████████████████████████████████████████████████████████| 8/8 [06:30<00:00, 48.86s/it]
Fetching eos2gth done in time: 0:06:30.863719s
👍 Model eos2gth fetched successfully!
(ersilia) isaakmwangi@DESKTOP-O9Q8PKD:~/ersilia$ ersilia serve maip-malaria-surrogate
🚀 Serving model eos2gth: maip-malaria-surrogate

   URL: http://127.0.0.1:44521
   PID: 609
   SRV: conda

👉 Available APIs:
   - predict

💁 Information:
   - info
(ersilia) isaakmwangi@DESKTOP-O9Q8PKD:~/ersilia$ ersilia api -i 'CCCOCCC'
{
    "input": {
        "key": "POLCUAVZOMRGSN-UHFFFAOYSA-N",
        "input": "CCCOCCC",
        "text": "CCCOCCC"
    },
    "output": {
        "score": 4.5375906733581415
    }
}
(ersilia) isaakmwangi@DESKTOP-O9Q8PKD:~/ersilia$

Isaakkamau commented 1 year ago

Now I am testing Ersilia maip-malaria-surrogate with a CSV file that has two columns with headers. You can get the dataset here: https://chembl.gitbook.io/malaria-project/input-data-file

Isaakkamau commented 1 year ago

Now specifying Maip to give output as a .csv file extension:

ersilia api predict -i MAIP_example.csv -o Ersilia_MAIP_Prediction.csv

Isaakkamau commented 1 year ago

Here is the .csv output file

Ersilia_MAIP_Prediction.csv

Isaakkamau commented 1 year ago

Now I want to repeat the prediction using the online API offered here: https://www.ebi.ac.uk/chembl/maip/

Isaakkamau commented 1 year ago

@GemmaTuron Thanks for the clarity, Let me select another Model from the proposed model list

DhanshreeA commented 1 year ago

For future reference, MAIP-Malaria-Surrogate is not part of the contribution period. Thanks for the update @Isaakkamau

Could you please update the issue with the model you have picked up now and the issues you are facing with it? Remember to post the complete stack trace, as well as your understand of it! Thanks.

Isaakkamau commented 1 year ago

Hello @DhanshreeA Noted Thanks!

Isaakkamau commented 1 year ago

Hello @GemmaTuron and @DhanshreeA After carefully going through the Four proposed Ersilia models for week two contributions I have really been interested in STOUT and SARS-CoV2 activity Models.

I would love to do both of them starting with STOUT.

The reason behind STOUT is being an aspirant Machine Learning Engineer most of my projects, I have mostly dealt with computer vision models like CNN but now in this project, I will have a chance to explore more algorithms like language translation and language understanding using Neural Machine Translation (NMT).

Moreover, I would also love to do a SARS-CoV2 activity project because I recently did a similar project at Udacity where I was also supposed to make a Machine Learning Application that users can interact with it by training their models, setting their preferred hyperparameters and making prediction using command line arguments and I have seen also in SARS-CoV2 activity it also requires some knowledge on using command line applications. Here is the project I did: https://github.com/Isaakkamau/Udacity-Create-Your-Own-Image-Classifier

Isaakkamau commented 1 year ago

When Installing STOUT I got this error

FileExistsError: [Errno 17] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly

The error was due Java Virtual Machine (JVM) dynamic-link library (DLL) cannot be found. This error usually occurs when the system is unable to locate the JVM, which is required for running Java programs.

To solve the error first, make sure that Java is installed on your system. You can do this by running the java -version command in your terminal or command prompt.

If it's not installed like in my case, I used the following commands to solve the problem:

sudo apt update

sudo apt install default-jdk

After the installation is complete, you can verify that Java is installed correctly by running the following command:

java -version

The above steps were able to solve my error

GemmaTuron commented 1 year ago

Hi @Isaakkamau

Thanks for sharing your previous work! If the STOUT model is now installed and working in your system, please complete week 2 tasks and move onto week 3! Let's make sure these are tackled before looking into the SARS-CoV models. You can also check adedeji's Git Issue for more info about the Stout model testing!

Isaakkamau commented 1 year ago

Hello, @GemmaTuron

I have successfully installed the STOUT model. About Run predictions for the EML I have chosen rather than predicting the entire EML dataset I have made a simple python script that can give our users an option to predict the only SMILES they are interested in.

Here is my script for making predictions:

from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = ["Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5", "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"]

for SMILE in SMILES:
  IUPAC_name = translate_forward(SMILE)
  print("IUPAC name of "+SMILE+" is: "+IUPAC_name)

The user can pass any number of SMILES to the above SMILES List for SMILES to IUPAC name translation.

Here is my prediction for the 3 SMILES that I have obtained from the EML:

IUPAC name of Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 is: [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
IUPAC name of C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 is: (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
IUPAC name of CC(=O)Nc1sc(nn1)[S](N)(=O)=O is: N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide

I now want to move on to the last part of week 2 Compare results with the Ersilia Model Hub implementation!

Please let me know if the results are satisfying.

Isaakkamau commented 1 year ago

Hello @GemmaTuron

Here are my week 2 Compare results with the Ersilia Model Hub implementation! results:

I have run the STOUT Model from the Ersilia Model Hub using the above three examples I had used with the original code. Here are the predictions for:

1. Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1

ersilia) hl@hl-laptop:~/ersilia$ ersilia api -i 'Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1'
{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}
(ersilia) hl@hl-laptop:~/ersilia$

2. C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5

(ersilia) hl@hl-laptop:~/ersilia$ ersilia api -i 'C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5'
{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol"
        ]
    }
}
(ersilia) hl@hl-laptop:~/ersilia$

3. CC(=O)Nc1sc(nn1)S(=O)=O

(ersilia) hl@hl-laptop:~/ersilia$ ersilia api -i 'CC(=O)Nc1sc(nn1)[S](N)(=O)=O'
{
    "input": {
        "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N",
        "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O",
        "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O"
    },
    "output": {
        "outcome": [
            "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide"
        ]
    }
}

Predictions Comparision Table:

SMILES	Original Code Prediction	Ersilia Predictions
SMILE 1	`[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol`	`[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol`
SMILE 2	`(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol`	`(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol`
SMILE 3	`N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide`	`N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide`

Results Explanation taking example 1 results:

There are some discrepancies between the two SMILE 1 Smiles to IUPAC predictions. From my research, the two IUPAC names appear to be of the same compound but have some differences in the stereochemistry of the cyclopentene ring.

I have also tried to explore and run more predictions on the SMILES using different sources like: https://app.syntelly.com/smiles2iupac and from all other predictions I have run on the SMILES the original code prediction appears to give the most probable answer.

Note:

We can not draw our final conclusion about the models' accuracy based only on the 3 examples that I have been using. The above is just a demonstration of some of the processes we can use.

@GemmaTuron That's it for week 2, Feel free to let me know if you have any questions or comments about my week 2 contribution.

Best regards Isaak

GemmaTuron commented 1 year ago

Hi @Isaakkamau !

Thanks for the work, very well documented! Let's tackle week 3 tasks then!

Isaakkamau commented 1 year ago

@GemmaTuron On it! and thank you for the kind comment

Isaakkamau commented 1 year ago

WEEK 3:

Suggest a new model and document it (1);

Model Name:

Malformer

Publication :

Large-Scale Chemical Language Representations Capture Molecular Structure and Properties

Source Code:

https://github.com/IBM/molformer

Description:

The above Paper discusses the use of machine learning models to accurately and quickly predict molecular properties in drug discovery and material design. However, the vast chemical space and limited availability of property labels make supervised learning challenging. To address this, the authors present MoLFormer.

The MOLFORMER's design is based to learn about a model trained on a small molecules which are represented as SMILES string. The Model architecture has an efficient linear attention mechanism and relative positional embeddings with the goal of learning a meaningful and compressed representation of chemical molecules.

License:

Apache

GemmaTuron commented 1 year ago

Good suggestion @Isaakkamau !

Can you add it to our model suggestion list? thanks!

Isaakkamau commented 1 year ago

Welcome @GemmaTuron Should I add it to Ersilia's suggestion list using this Form or I open a new model request issue?

Isaakkamau commented 1 year ago

Hello @GemmaTuron I have added it here: https://github.com/ersilia-os/ersilia/issues/658, Please have a look if it's okay that way

Zainab-ik commented 1 year ago

Welcome @GemmaTuron Should I add it to Ersilia's suggestion list using this Form or I open a new model request issue?

Hi, I think she meant the form, not opening a model request issue.

Isaakkamau commented 1 year ago

Suggest a new model and document it (2):

Model Name:

controlled-peptide-generation

Slug:

Peptide autoencoder

Publication :

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics

Other Publications:

https://www.nature.com/articles/s41551-021-00689-x

Source Code:

https://github.com/IBM/controlled-peptide-generation

Description:

The model uses deep learning classifiers trained on an informative latent space of molecules modeled using deep generative autoencoders to present an efficient computational method for the generation of antimicrobials with desired attributes

Summary

The authors of the "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics" paper have used deep generative autoencoder models together with deep learning classifiers to create a computational method identified as Controlled Latent attribute Space Sampling (CLaSS). The CLaSS method has then been used for designing non-toxic antimicrobial peptides (AMPs). The above method successfully generated 20 AMPs. The paper concludes by suggesting that the method can be used to accelerate the discovery of potent and selective broad-spectrum antimicrobials

Data

The above repo is using short versions of data files that are required by the data curation code at data_processing/data dir

License:

Apache

Isaakkamau commented 1 year ago

Welcome @GemmaTuron Should I add it to Ersilia's suggestion list using this Form or I open a new model request issue?

Hi, I think she meant the form, not opening a model request issue.

Thank you @Zainab-ik I have submitted it

GemmaTuron commented 1 year ago

Hi @Isaakkamau !

Indeed, I meant to the list, I've closed the model request issue. Could you please provide a bit more of information on model 2?

Thanks!

Isaakkamau commented 1 year ago

Hello @GemmaTuron

Sure, I have added a summary of Model 2. Please check it out.

Thanks

Isaakkamau commented 1 year ago

Suggest a new model and document it (3):

Model Name:

Graphormer

Publication:

Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets

Graphormer Usage Instructions:

https://graphormer.readthedocs.io/en/latest/

Graphormer Project Website:

https://www.microsoft.com/en-us/research/project/graphormer/

Model Description:

Graphormer is a deep-learning python package for training custom models for molecule modeling tasks. Graphomer model architecture and adaptation have been modified to 3D molecular daynamic simulation which allows the model to perform well on 2D and 3D molecular graph modeling datasets. Researchers and developers can use it as a catalyst in researching and applying AI for molecule science such as drug discovery. Graphormer provides example scripts to train your own model on several datasets using a command-line interface. It also provides pre-trained models that researchers can easily evaluate and fine-tune.

Source Code:

https://github.com/microsoft/Graphormer

Lisence

MIT License

Isaakkamau commented 1 year ago

Hello @GemmaTuron I also found these other models:

LiGAN deep generative models for structure-based drug discovery (a python package, but it also depends on C++/CUDA.) LiGAN
MegaMolBART is a deep learning model for small molecule drug discovery and cheminformatics based on SMILES. MegaMolBART
LIMO Latent Inceptionism for Targeted Molecule (with desired properties) Generation LIMO

If they might be interesting to Ersilia I can also add them to the suggestion list!

Thanks

GemmaTuron commented 1 year ago

Hi @Isaakkamau !

Thanks for these suggestions! Can you please add Graphormer to the ersilia model suggestion list? I really like LiGAN and LIMO, we cannot add them natively to the Hub because they deal with protein structures, but will keep that in mind. For the MegaMolBART, we already have other language models pretrained to be used in model training - we actually rely on MolBART which is the basis of MEgaMolBart.

Once you have added the models in the list, please focus on writing up your final application

Isaakkamau commented 1 year ago

Hi, @GemmaTuron Noted! I have added the Graphormer to the suggestion list.

Now starting the final application

ersilia-os / ersilia

✍️ Contribution period: <Isaak Mwangi Kamau> #620

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Predictions Comparision Table:

Results Explanation taking example 1 results:

Note:

WEEK 3:

Suggest a new model and document it (1);

Model Name:

Publication :

Source Code:

Description:

License:

Suggest a new model and document it (2):

Model Name:

Slug:

Publication :

Other Publications:

Source Code:

Description:

Summary

Data

License:

Suggest a new model and document it (3):

Model Name:

Publication:

Graphormer Usage Instructions:

Graphormer Project Website:

Model Description:

Source Code:

Lisence