✍️ Contribution period: Zainab Ashimiyu-Abdusalam

Zainab-ik commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Zainab-ik commented 1 year ago

Week 1 Tasks

Ersilia Installation successful

Model fetch successful

Systems Specification

Windows 11
Operating System; Ubuntu

GemmaTuron commented 1 year ago

Hi @Zainab-ik

Welcome to Ersilia, great to have you here. Please let us know which system are you using, and whenever possible, refrain from pasting screenshots as they are more difficult to review by mentors! Thanks

Zainab-ik commented 1 year ago

Hi @Zainab-ik

Welcome to Ersilia, great to have you here. Please let us know which system are you using, and whenever possible, refrain from pasting screenshots as they are more difficult to review by mentors! Thanks

Thank you @GemmaTuron

it's great to be here also. All comments noted. I'd update the last comment.

Zainab-ik commented 1 year ago

Motivation Statement

Hi everyone, my name is Zainab and I was a graduate intern at the Nigeria Institute of Medical Research. Also, I had a BSc. in Pharmacology, Therapeutics, and Toxicology. I'm interested in Ersilia because it's at the intersection of Machine Learning and drug discovery. I came across Ersilia while browsing through the Outreachy available project.

What drew me to Ersilia was the intersection of Artificial Intelligence (AI) and my background study, Pharmacology. I developed a passion for Computational Pharmacology and had my thesis along the line of predicting the antiviral properties of a natural plant using computational tools. I also belong to CaresAI, a cancer drug discovery research group with AI methodology as a focus point.

I also resonate with Ersilia's mission of tailoring their models to researchers from the LMIC (Low-Middle Income Countries) which thereby improves the quality of their research and outcome. I am a lover of open science and Ersilia is just the right open-science project for me looking at my background and my skill set.

Ersilia projects span malaria, and infectious diseases and these are my core research interest. My skill set includes; ML, Python, Linux, bioinformatics tools, etc. Contributing to Ersilia would be a way of giving back to the community and that's what open source is about.

Learning while giving back

I intend to take on more AI in drug discovery projects and working with Ersilia will improve my skills and give me confidence. Furthering my studies in Computational Pharmacology is a career progression that I believe Ersilia is going to be a great propeller in achieving.

Thank you for reading my motivation letter, I'm excited to contribute to Ersilia while also collaborating with my peers to make science accessible to all.

Zainab-ik commented 1 year ago

Hi, @GemmaTuron, Week 1 contribution completed. Kindly review. Thanks.

GemmaTuron commented 1 year ago

Thanks @Zainab-ik ,

Welcome to the contribution period!

Zainab-ik commented 1 year ago

Thanks @Zainab-ik ,

Welcome to the contribution period!

Thanks, @GemmaTuron.

Zainab-ik commented 1 year ago

Week 2

Task 1 - Select a model

Model Selected - SARS-CoV2 activity (Image Mol)

The Image Mol model takes in molecular images as the dataset. Having worked in the molecular lab, Seeing results generated as images being used in training a model would be interesting and therefore choose this model. I'm interested in its application for running unlabeled images of drug-like molecules since data annotation for compounds is tedious. The therapeutic effect of a drug can be evaluated by different means with the molecular properties, drug synthesis and low toxicity level being important factors of the therapeutics determination. The Image Mol model can also enhance the drug discovery's molecular docking process since the binding can be viewed. Lastly, having worked with SARS-CoV-2 molecules previously, using the molecular images of its target protein with the corresponding compound of interest, this is a good model to cement that knowledge. I also get to learn about computer vision. You can read more about the model here Thank you.

@GemmaTuron

Zainab-ik commented 1 year ago

Model Installation Process

This is a detailed step of how I installed the model on my system

This process is done on the Linux machine (WSL - Ubuntu)

Firstly, install the environment where the model is going to run. That's the CUDA 10.1 GPU environment. I followed this guide to install the environment. However, along the line, I ran into an error while installing the CUDA 10.1 using sudo apt install cuda-10-1 after setting up the package repository. Error message ;

E: unable to locate package cuda-10-1

I then used the command sudo apt install nvidia-cuda-toolkit and successfully installed CUDA GPU required environment.

Create a Conda environment for the model by running conda create -n imagemol python=3.7.3 and activate it using conda activate imagemol
Download some packages;

Rdkit using conda install -c rdkit rdkit
Pytorch using pip install torch-cluster torch-scatter torch-sparse torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0%2Bcu101.html. However, I got an error while running this.

Collecting torch-cluster Downloading torch_cluster-1.6.0.tar.gz (43 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.4/43.4 kB 218.2 kB/s eta 0:00:00 Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File "", line 36, in File "", line 34, in File "/tmp/pip-install-6k22__7f/torch-cluster_4930bb5f17ce4e118a9da372c9ef78c0/setup.py", line 8, in import torch ModuleNotFoundError: No module named 'torch' [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details.

From the ModuleNotFoundError: No module named 'torch', I installed pytorch manually using conda install pytorch-cpu torchvision-cpu -c pytorch However, I ran into another error showing,

Using cached torch_cluster-1.6.0.tar.gz (43 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File "", line 36, in File "", line 34, in File "/tmp/pip-install-m6s5wd8n/torch-cluster_c5c4961158204cfc86d0996e50312235/setup.py", line 10, in from torch.config import parallel_info ImportError: cannot import name 'parallel_info' from 'torch.config' (/home/zainab_ik/miniconda3/envs/imagemol/lib/python3.7/site-packages/torch/config.py) [end of output] I'm currently working on resolving the error.

@GemmaTuron @DhanshreeA

GemmaTuron commented 1 year ago

Hi @Zainab-ik !

This model has two versions: one for TRAINING the model, and one for simply running predictions on pre trained models (FInetuning section on the readme) the training requires CUDA-GPU, but the finetuning does not. Do not install the CUDA - GPU environment because it won't work in most systems, follow the instructions to run predictions only. I hope this helps! I think Ahmed was working on this model as well!

Zainab-ik commented 1 year ago

Hi @Zainab-ik !

This model has two versions: one for TRAINING the model, and one for simply running predictions on pre trained models (FInetuning section on the readme) the training requires CUDA-GPU, but the finetuning does not. Do not install the CUDA - GPU environment because it won't work in most systems, follow the instructions to run predictions only. I hope this helps! I think Ahmed was working on this model as well!

Thank you @GemmaTuron, I'd make an update.

Zainab-ik commented 1 year ago

Model Installation Process (contd)

As stated above, I ran into a couple of errors while installing the Torch packages.

The error above was resolved by installing Torch separately.

To download the other Pytorch packages such as the torch-cluster, torch-scatter, torch-sparse, torch-spline-conv I encountered an error showing;

ERROR: Failed building wheel for torch-sparse
ERROR: Failed building wheel for torch-cluster
ERROR: Failed building wheel for torch-scatter
ERROR: Failed building wheel for torch-spline-conv

while trying to install the packages. This occurred due to package version clashes. From my understanding, I noticed the PyTorch packages are sensitive to the package version. To download a certain package, you might need to downgrade or upgrade a certain package it depends on.

How to resolve the Error Since the packages are PyTorch Geometric (PyG) packages, I read the installation guide from the PyG website here. I installed the packages from the source directly via conda using conda install pyg -c pyg since pip install keep throwing this error,

(imagemol) zainab_ik@DESKTOP-E8NO3DG:~$ pip install torch-spline-conv
Collecting torch-spline-conv
  Using cached torch_spline_conv-1.2.1.tar.gz (13 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-vgddfgna/torch-spline-conv_6514989605ba443d8a0107c604ef6a92/setup.py", line 8, in <module>
          from torch.utils.cpp_extension import BuildExtension
      ModuleNotFoundError: No module named 'torch.utils.cpp_extension'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

and I was able to install the packages together. I ran the required installation code again pip install torch-cluster torch-scatter torch-sparse torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0%2Bcu101.html and my output is;

Requirement already satisfied: torch-cluster in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (1.6.0)
Requirement already satisfied: torch-scatter in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (2.1.0)
Requirement already satisfied: torch-sparse in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (0.6.16)
Requirement already satisfied: torch-spline-conv in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (1.2.0)
Requirement already satisfied: scipy in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (from torch-sparse) (1.7.3)
Requirement already satisfied: numpy<1.23.0,>=1.16.5 in ./miniconda3/envs/imagemol/lib/python3.7/site-packages (from scipy->torch-sparse) (1.21.5)

I confirmed the packages using the grep torch and my output is; Finally, I was able to download all the PyTorch packages.

To complete Installation

Activate the imagemol environment using the conda activate imagemol
Clone the repository by git clone https://github.com/HongxinXiang/ImageMol.git
Install the requirements for the model to work like the TQDM by running pip install -r requirements.txt

And finally, the model is installed.

@GemmaTuron, I'd love to ask if installation via either conda or pip doesn't matter since the installation process of the model specified pip.

Zainab-ik commented 1 year ago

Hi @Zainab-ik !

This model has two versions: one for TRAINING the model, and one for simply running predictions on pre trained models (FInetuning section on the readme) the training requires CUDA-GPU, but the finetuning does not. Do not install the CUDA - GPU environment because it won't work in most systems, follow the instructions to run predictions only. I hope this helps! I think Ahmed was working on this model as well!

currently working on the running prediction models as stated.

Zainab-ik commented 1 year ago

Finetuning Models

Running Predictions of the SARS-CoV2 activity with the ImageMol The SARS-Cov2 dataset comprises of different activity datasets namely ;

ACE2 enzymatic activity
SARS CoV Pseudotyped particle entry (VeroE6 tox counterscreen)
Human fibroblast toxicity
Spike-ACE2 protein-protein interaction (TruHit Counterscreen)
SARS-CoV Pseudotyped particle entry
3CL enzymatic activity
MERS Pseudotyped particle entry (Huh7 tox counterscreen)
TMPRSS2 enzymatic activity
SARS-CoV-2 cytopathic effect (host tox counterscreen)
HEK293 cell line toxicity
MERS Pseudotyped particle entry
SARS-CoV-2 cytopathic effect (CPE)
Spike-ACE2 protein-protein interaction (AlphaLISA)

Since the Model has been installed and the environment activated. You can finetune the model by running;

python finetune.py --gpu ${gpu_no} \
                   --save_finetune_ckpt ${save_finetune_ckpt} \
                   --log_dir ${log_dir} \
                   --dataroot ${dataroot} \
                   --dataset ${dataset} \
                   --task_type ${task_type} \
                   --resume ${resume} \
                   --image_aug \
                   --lr ${lr} \
                   --batch ${batch} \
                   --epochs ${epoch}

You edit the code to fit the specific dataset you want to use.

I'd be starting with the ACE2 enzymatic activity dataset. The finetuning should work with this code

python finetune.py --gpu 0 \
               --save_finetune_ckpt 1 \
               --log_dir ./logs/toxcast \
               --dataroot ./datasets/finetuning/SARS-CoV-2 \
               --dataset  ACE2_enzymatic_activity \
               --task_type classification \
               --resume ./ckpts/ImageMol.pth.tar \
               --image_aug \
               --lr 0.5 \
               --batch 64 \
               --epochs 20

However, I encountered several errors ranging from;

finetune.py : command not found, --image-aug: command not found, lr: command not found
python: can't open file 'finetune.py': No such file or directory
Assertion error: a particular path is not a directory

Trial: For every command not found error, I tried removing them and it throws back the same error for another argument. For the assertion error, I copied the path from my local system and modified it and it still throws back the same error. I also worked with the original SARS-CoV-2 assay that came with the repository which is the 3CL_enzymatic_activity. and encountered the assertion error. I'm starting the installation and cloning again and trying the finetuning.

Update After re-installation, I figured the error was due to the path. I followed the instruction again to push the pre-trained model into the ckpts\ path and put my downloaded dataset into the finetuning\ path rather than the toy\ path it was placed in the repository. I finetuned the SARS-CoV-2 model on this ACE2_enzymatic_activity dataset again.

Result The result of Finetuning on the ACE2_enzymatic_activity is

final results: highest_valid: 0.742, final_train: 0.461, final_test: 0.361

Model evaluation This was done by running the code

python evaluate.py --dataroot ./datasets/finetuning/SARS-CoV-2 \
                               --dataset ACE2_enzymatic_activity \
                               --task_type classification \
                              --resume ./ckpts/ImageMol.pth.tar \
                              --batch 128

Error encountered While running this, I encountered a RuntimeError: Error(s) in loading state_dict for ResNet: Unexpected key(s) in state_dict which I figured was due to the version and dependency clash. I edited the evaluate.py file to overlook the version difference by adding ['state_dict'], strict=False) to the model_load code. I then evaluated it again.

Result [test] rocauc: 41.4% This is a very low value which indicates a poor model performance,

Kindly review@DhanshreeA @GemmaTuron.

GemmaTuron commented 1 year ago

Hi @Zainab-ik

Great, I was going to point to the path being maybe incorrect, good catch! Thanks for trying the finetuning of the dataset, fantastic work. Since the ACE2 is not implemented in the Hub, can you simply try to run predictions with the model you fine tuned for a few molecules and with that, tick off the taskst from week 2 and focus on week 3?

Good job!

Zainab-ik commented 1 year ago

Thanks @GemmaTuron, I'd work on that.

Zainab-ik commented 1 year ago

Running Predictions for a few molecules using the SARS-CoV-2 finetuned Model

Running prediction is the same as evaluating the models with few molecules for the ImageMol Model. The Model takes in a smiles list with Index numbers and labels (0,1) as input accompanied by the corresponding Image since it's an Image Model.

While working with the Essential Medicine List (eml_canonical.csv), I had to download the corresponding images for a few smile lists to run predictions. However, I encountered a

ValueError: Only one class is present in y_true. ROC AUC score is not defined in that case. This is due to some reasons;

The default batch size for prediction is 128, and I was running just five molecules

The Model does not process the corresponding images

I had to replicate the molecules into a folder just like the rest of the bioassays because it was throwing a directory error.

Debugging phase

I tried the following to bypass the errors and make predictions on a few molecules.

I used an already processed dataset among the SARS-CoV-2 bioassays. However, I reduced it to just a few 30 to 40 molecules.
I edited the default batch size in the evaluate.py script from 128 to a lesser value to enable prediction on a few molecules. The default batch size of 128 disables predictions on molecules less than 128.

Outcome of running predictions on few molecules

Using the above code for running prediction. This was done on 128 molecules while debugging.
On 50 and lesser molecules while tuning the hyper-parameters like the batch size

This result shows that a higher percentage of the molecules have an inhibitory activity towards the ACE2 activity in the SARS-CoV-2 with rocauc of 75%. The only evaluation metric is the AUC value.

@GemmaTuron Kindly review. Also, while going through the Ersilia Model Hub, the model eos4cxk has a similar implementation to SARS-CoV-2 ImageMol Model.

GemmaTuron commented 1 year ago

Hi @Zainab-ik !

Good job on working that model out! With that I think you can tick off all week 2 tasks as completed (since the model is not yet implemented in the EMH for comparison) and we can move onto week 3 tasks, looking forward to seeing your model suggestions!

Zainab-ik commented 1 year ago

Hi @GemmaTuron, Thank you very much. I'm looking forward to reviewing literature and suggesting models.

Zainab-ik commented 1 year ago

Week 3 Task --- Model Proposal [1]

Model Title

hERG blocker: Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models

Model Description ---Prediction of hERG Blocker

This model integrated molecular embedding with deep neural network algorithms and gradient boosting tree to predict potential hERG blockers. The blockage of the hERG channel causes cardiotoxicity and leads to the withdrawal of drugs from clinical trials. Therefore, It's important for a therapeutic compound to be screened for cardiotoxicity in the drug development pipeline. This model was trained using 8641 compounds, mostly either FDA-approved, experimental, or investigational drugs, from the DrugBank database. This model was further tested on large datasets and delivered an accuracy of 0.981. The molecular feature was generated using transformer NPL techniques.

Model Identifier

Slug: hERG blocker

Model Characteristics

Input: compound - smiles format.
Task: classification
Tag: Toxicity, hERG, cardiotoxicity,
Output: Score(0,1), Probability

References

License

The model contained within this package is licensed under an MIT license.

Zainab-ik commented 1 year ago

Model Proposal [2]

Model Title

DeepDrugCoder (DDC): Heteroencoder for molecular encoding and de novo generation

Model Description

Deep-Drug-coder is a generative model that employs the neural network approach by constructing a conditional recurrent neural network (cRNN) to generate active molecules with specified properties for a certain condition. The model can generate huge datasets of novel molecules for further assessment, such as ADME, Toxicity, etc. The model aggregates selected molecular descriptors and a bioactivity label (0,1) and generated SMILES strings focused on the targeted properties. Featurization was performed using molvecgen.

Model Identifier

Slug: DeepDrugCoder (DDC)

Model Characteristics

Input: compound
Task: Generative
Tag: molecule generator, neural network
Output: compound

References

License

The model contained within this package is licensed under an MIT license.

Zainab-ik commented 1 year ago

Model Proposal [3]

Model Title

DRKG model - Drug Repurposing Knowledge Graph for Covid-19

Model Description

The DRKG is a comprehensive knowledge graph that connects different entities, such as genes, compounds, diseases, biological processes, side effects, and symptoms, together in entity-pair. It uses the Knowledge Graph Embedding (KGE) machine learning methodology for evaluation and analysis. The main focus of the comprehensive interaction from this knowledge graph is the Compound-disease interaction for Covid-19 drug repurposing. The pretrained DRKG model for drug repurposing for COVID-19 predicts whether existing drugs successfully inhibit certain pathways related to Covid-19 host proteins using the KGE models.

Model Identifier

Slug: DRKG

Model Characteristics

Input: compound
Task: classification
Tag: knowledge graph, drug repurposing, drug-target interaction, DTI.
Output: score

References

License

The model contained within this package is licensed under an Apache License, Version 2.0

GemmaTuron commented 1 year ago

Hi @Zainab-ik !

Good model suggestion to start with! Can you add it to our list? Looking forward to the next models!

Zainab-ik commented 1 year ago

Hi @Zainab-ik !

Good model suggestion to start with! Can you add it to our list? Looking forward to the next models!

Thank you very much @GemmaTuron. I've updated the list with the model and added one more model suggestion.

Zainab-ik commented 1 year ago

Hi @GemmaTuron!

I've completed the model suggestion. Kindly review. Also, your mentorship is really appreciated. Thank you for the exposure to the AI world of drug discovery. It's a career path I want to pursue as a Research scientist, which is very insightful.

GemmaTuron commented 1 year ago

Hi @Zainab-ik !

Thanks for the model suggestions! The generative model is interesting but would be probably difficult to incorporate as is to the Hub, so we'll leave it for the moment! Can you add the COVID19 model to the list? Thanks!

Zainab-ik commented 1 year ago

Hi @GemmaTuron,

Thank you. I've added it to the list.

GemmaTuron commented 1 year ago

@Zainab-ik

Would you be able to tackle this issue to test the model just incorporated by @emmakodes? Thanks!

Zainab-ik commented 1 year ago

@Zainab-ik

Would you be able to tackle this issue to test the model just incorporated by @emmakodes? Thanks!

I'd do that. @GemmaTuron, This issue has been completed.

Zainab-ik commented 1 year ago

Hi @GemmaTuron!

I have submitted my final application. I look forward to making more meaningful contributions to this field. It's been a wonderful experience and your mentorship is really appreciated for the successful contribution. Many thanks.

I'd be closing this issue now.

ersilia-os / ersilia