ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Boaz Leleina #830

Closed boazleleina closed 8 months ago

boazleleina commented 9 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

boazleleina commented 9 months ago

slack channel After being accepted to the Outreachy program, I selected the Ersilia project after reading up on all the projects as I connected to their research, especially in the field of disease research in ML. As a long supporter of ML for good, I feel this project is doing wonderful work and hope to be part of this amazing journey.

I also successfully created this issue🎉

boazleleina commented 9 months ago

As I am running Windows OS on my machine, I will be using WSL with Ubuntu 22.04.3 LTS

ubuntuversion.log

I am also running

I successfully ran ersilia and installed all the required prerequisites. I followed the instructions found here.

GITLFS.log

I installed and activated ersila, I also installed the Isaura data lake in the prerequisites.

Isaura.log

I installed the Ersilia Python Package by running: git clone https://github.com/ersilia-os/ersilia.git cd ersilia pip install -e .

developer.log

I installed docker and ran it and was able to note a container running after I fetched the model.

boazleleina commented 9 months ago

Running a simple model on ersilia

I checked the model catalogs using: ersilia catalog

catalog.log

Ran the following commands, they ran without errors: ersilia -v fetch eos3b5e ersilia serve eos3b5e

serve.log

However, after running the code: ersilia -v run -i "CCCC"

I ran into the error TypeError: object of type 'NoneType' has no len()

runmodel.log

leilayesufu commented 9 months ago

Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account

carcablop commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

Thanks.

boazleleina commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

  • Do not paste images, you can attach a .log file, it is easier to review errors.
  • Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.
  • Do not modify the base code. Create another environment and install ersilia.

Thanks.

This is well noted @carcablop, thank you. I am making the required changes and editing my issue

boazleleina commented 9 months ago

Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account

Thank you @leilayesufu, I will take this into account when making my changes. I appreciate it.

boazleleina commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

  • Do not paste images, you can attach a .log file, it is easier to review errors.
  • Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.
  • Do not modify the base code. Create another environment and install ersilia.

Thanks. I made the requested changes. I also included the log files. After reinstalling the environment and running the code again, I ran into the same error

leilayesufu commented 9 months ago

Hi, did you specify your python version. conda create -n ersilia python=3.7

boazleleina commented 9 months ago

Hi, did you specify your python version. conda create -n ersilia python=3.7

Yes I did, I also reinstalled the environment again and started from scratch, it still generated the same error for me

boazleleina commented 9 months ago

I have been facing the issue of the error: TypeError: object of type 'NoneType' has no len()

Steps to recreate the error:

  1. conda create -n ersilia python=3.10
  2. conda activate ersilia
  3. python -m pip install isaura==0.1
  4. git clone https://github.com/ersilia-os/ersilia.git
  5. cd ersilia
  6. pip install -e .
  7. ersilia -v fetch eos3b5e
  8. ersilia serve eos3b5e

Up to this point all the code ran as expected without producing any errors, the main issue was with the code: ersilia -v run -i "CCCC"

Below is the log file of the error: runmodel.log

DhanshreeA commented 9 months ago

Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?

boazleleina commented 9 months ago

Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?

  • Reinstall conda and reinstall Ersilia with Python 3.7
  • Try installing Ersilia with a version of Python greater than 3.7

I retried the steps using miniconda and python 3.10, the error still persists. I deleted everything initially and created a new environment using miniconda before retrying, the issue is with the same error

boazleleina commented 9 months ago

The error seems to be persistent for wsl: TypeError: object of type 'NoneType' has no len()

I retried the steps mentioned in the above issue using miniconda and both Python 3.10 and 3.7 Below is the log file while using miniconda and Python 3.7 logfile.log

carcablop commented 9 months ago

Hi @Boadiwaa. Try uninstall the isaura package and try running the model again. It seems that Isaura is causing that error.

boazleleina commented 9 months ago

After following instructions from @carcablop, the model ran successfully. Steps I took:

  1. Uninstall Isaura pip uninstall isaura==0.1

  2. Ran the fetch model ersilia -v fetch eos3b5e

  3. Served the model ersilia serve eos3b5e

  4. Ran the model ersilia -v api run -i "CCCC"

The above steps worked for WSL with Python 3.7 running

runmodelubuntu.log

boazleleina commented 9 months ago

Task 4: Write a Motivation Statement to work at Ersilia

MOTIVATION TO WORK AT ERSILIA

I am writing this to express my excitement about the opportunity to be a part of the Ersilia team. I believe in the strong alignment between my journey in AI/ML and the objectives of Ersilia. I am a graduate Software Engineer and AI/ML practitioner with experience working with Machine Learning and Deep Learning models. In my time in the field, I have been a staunch believer of AI for good as I believe we can use this technology to bring positive changes and impact in society that can take us to a better future. I have had the privilege of working on projects and experiences that are aligned towards this goal of AI/ML for good. Working on a machine learning project to track the travel patterns of pastoralist communities and their animals in Northern Kenya was one of my most fulfilling experiences. This effort was essential in helping authorities plan and allocate resources, such as security and medication, along their routes. This first-hand encounter made clear to me the ability of technology to significantly improve people's lives. Growing up in a marginalized community with limited resources, I have always believed in the responsibility we all share to support those in more vulnerable positions. This principle is at the core of Ersilia's mission to support research in Low-Income Countries (LICs). It deeply resonates with me, as I have seen the challenges faced by communities in such regions. I am genuinely excited about the prospect of contributing to research efforts that can improve healthcare and create a more equitable world. I have also witnessed the devastating impact of inadequately researched diseases like Rift Valley Fever on my own family, through the loss of my grandfather, and the greater community. The lack of sufficient research and data on these rare diseases has left many vulnerable. Therefore, I wholeheartedly believe in the potential of the work being done by Ersilia to combat future outbreaks and save lives. I am eagerly looking forward to the research that will be conducted, knowing that it can make a significant difference, not only now but for generations to come. If granted the opportunity to intern at Ersilia, I am fully committed to continuing the outstanding work of creating AI/ML models focused on disease research. I will bring to the table not only my technical skills but also a deep sense of purpose and dedication. I understand that the work I do here has the power to transform lives and leave a lasting impact on society. I'm determined to continue my career in artificial intelligence after the internship is over. My ultimate objective is to establish a setting where AI brings people together for the purpose of raising the standard of living. I am really excited about the chance to work with like-minded people and truly make a difference.

boazleleina commented 9 months ago

Task 5: Submit your first contribution to the Outreachy site

I made my first contribution to the Outreachy website and the contribution was recorded

boazleleina commented 9 months ago

Week 2 - Install and run an ML model

Task 1: Select a model from the suggested list

I selected Plasma Protein Binding (IDL-PPBopt)

Reason for choosing the model

The focus of this research is on understanding and manipulating how chemical compounds bind to human plasma proteins. This is a critical aspect of pharmacology and drug development because it affects the distribution and activity of drugs in the human body. In a world with so many options for drugs and high risks and side effects involved with each drug, this research is important in helping determine how each drug affects the body and guiding the development of better drugs to combat diseases in the future. The research employs deep learning techniques, a subset of machine learning, to accomplish the prediction and optimization goals. Which would be an interesting challenge for me. I look forward to working with AttentionFP algorithm (Attentive Feature Propagation for Molecular Property Prediction) designed for the prediction of molecular properties, particularly in the field of chemoinformatics and drug discovery. It leverages a deep learning architecture to capture and analyze the structural information of chemical compounds in a way that is suitable for property prediction tasks. AttentionFP is built on Graph Neural Networks, which is quite interesting as molecular data, chemical compounds can be represented as graphs, where atoms are nodes, and chemical bonds are edges. This can be seen in some outputs in the notebook which makes it easier to understand and visualize.

boazleleina commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git


The project has some package requirements that it needs to run including: 📦

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:

conda create --name IDLenv python=3.7
conda activate IDLenv

Installing Dependencies 🧩

  1. python 3.7 - Creating the conda environment running this Python version

  2. pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

    • " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
  3. openbabel 2.4.1 - pip install openbabel openbableerror.log

    • After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel
    • In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.
  4. rdkit - pip install rdkit

  5. scikit learn - pip install scikit-learn

  6. scipy - pip install scipy

  7. cairosvg - pip install cairosvg

  8. pandas - pip install pandas

  9. matplotlib - pip install matplotlib


Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

SOLVING CUDA ISSUE🔄

Steps I took to change this:🌓

  1. In the folders in AttentiveFP directory made changes to the files:

    • AttentiveLayers_viz.py
    • AttentiveLayers.py
    • I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work
  2. In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:

    • In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor

Model Success🎉

The model successfully predicted the values:

predictedvalues.log

The graphs were generated by AttentionFP and I could see the structure of different substructure molecules

graphgenerated.log graphgeneratedimg.pdf

graph2generated.log graph2generatedimg.pdf

graph3generated.log graph3generatedimg.pdf

graph4generated.log graph4generatedimg.pdf

substructures.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

DhanshreeA commented 9 months ago

Hi @boazleleina thanks for the detailed updates! Could you comment on the difference between the model you ran locally and the one present within the hub? Specifically which substructures are different and/or missing between the two outputs?

boazleleina commented 9 months ago

Task 3: Run predictions for the EML


eml_canonical_output.log

eml_canonical_predictions.csv


Explanation of the outcome The model predicts the likelihood of various drugs binding to proteins in human blood plasma, using deep learning model, the predicted values range between 0 and 1, with the values closer to 1 showing a higher probability of the drug binding and those closer to 0 showing a lower probability of the drug binding to the proteins in the human blood plasma

This model is a Neural Network (with appropriate activation functions) that is used for the regression task, specifically using AttentiveFP, which is built on the Graph Neural Network with feedforward propagation, and different activation functions including linear, leakyrelu, relu and softmax.

boazleleina commented 9 months ago

Install and run Docker!

pullppbmodel.log

After executing the docker container and opening the interactive environment, I ran predictions for the eml_canonical dataset successfully:

ersiliapredictionlog.log

key input
FAQLAUHZSGTTLN-UHFFFAOYSA-N [CaH2]
KRHYYFGTRYWZRS-UHFFFAOYSA-M [F-]l
ZCYVEMRRCGMTRW-UHFFFAOYSA-N [I]
XLYOFNOQVPJJNP-UHFFFAOYSA-N O
WCUXLLCKKVVCTQ-UHFFFAOYSA-M [Cl-].[K+]
NLKNQRATVPKPDG-UHFFFAOYSA-M [K+].[I-]
RWSOTUBLDIXVET-UHFFFAOYSA-N S
FJKGRAZQBBWYLG-UHFFFAOYSA-M N.N.[F-].[Ag+]
FAPWRFPIFSIZLT-UHFFFAOYSA-M [Na+].[Cl-]
DhanshreeA commented 9 months ago

Great job @boazleleina thank you for the updates!

boazleleina commented 9 months ago

Week 3 - Propose new models


Task 1 - Suggest a new model and document it (1)

Model Name : Controlled peptide generation Model Link🔗 : https://github.com/IBM/controlled-peptide-generation/tree/master Model License 📑: Apache-2.0 license

This model uses the Pytorch framework and is generated from the research paper “Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics” by Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu Cipcigan, Vijil Chenthamarakshan, Hendrik Strobelt, Cicero dos Santos, Pin-Yu Chen, Yi Yan Yang, Jeremy Tan, James Hedrick, Jason Crain, Aleksandra Mojsilovic, submitted on 22nd May 2020 and revised on 26th Feb 2021, in the Machine Learning section of arxiv.org

Paper Link🔗: https://arxiv.org/abs/2005.11248


Overview


Why the model is relevant to Ersilia

  1. With Ersilia’s goal to equip laboratories in Low- and Middle-Income Countries with state-of-the-art AI/ML tools for infectious and neglected disease research, this ML model will enable laboratories in Low-Income areas to work on drug creation at low costs, combining computational techniques and experimental validation, can potentially accelerate the process of discovering and developing new drugs while reducing the cost significantly.
  2. De novo therapeutic design allows for the creation of highly specific and customized therapies. Instead of modifying existing drugs, researchers can design molecules with precisely tailored properties to target specific diseases or conditions, e.g., diseases affecting their specific countries. This can lead to more effective treatments with fewer side effects.
  3. With the growing concern over drug-resistant diseases and the need for innovative solutions to address emerging health challenges (such as COVID-19 and other infectious diseases), research that explores novel approaches to drug design becomes increasingly relevant.
  4. The discovery of molecules that are effective against multidrug-resistant bacteria is particularly important. Multidrug-resistant pathogens pose a significant threat to public health, and finding new treatments for these infections is of utmost importance. This will allow labs that Ersilia works with to be at the forefront of this innovative drug-creation method.

Working with the model

boazleleina commented 9 months ago

Task 2 - Suggest a new model and document it (2)


Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None

This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023

Paper Link 🔗: https://arxiv.org/abs/2308.05115


Overview

Why the model is relevant to Ersilia

  1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.
  2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.
  3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

boazleleina commented 9 months ago

Task 3 - Suggest a new model and document it (3)


Model name: Multitask-toxicity Model Link🔗: https://github.com/IBM/multitask-toxicity#accurate--clinical-toxicity-prediction-using-multi-task-deep-neural-nets-and-contrastive-molecular-explanations Model License 📑: Apache-2.0 license

This model uses the Pytorch framework, and is generated from the research paper, “Accurate Clinical Toxicity Prediction using Multi-task Deep Neural Nets and Contrastive Molecular Explanations.” By Bhanushee Sharma, Vijil Chenthamarakshan, Amit Dhurandhar, Shiranee Pereira, James A. Hendler, Jonathan S. Dordick, and Payel Das

Paper Link🔗: https://arxiv.org/abs/2204.06614


Overview

Why the model is relevant to Ersilia

  1. Using predictive models to assess toxicity can reduce the reliance on animal and clinical testing. This is ethically important as it minimizes harm to animals and reduces the need for human clinical trials, making the drug development process more humane.
  2. Early toxicity prediction helps in reducing the cost of drug development. Clinical trials are expensive, and identifying toxicity issues before reaching this stage can lead to significant cost savings.
  3. The use of machine learning models provides a more efficient and rapid way to screen potential drug candidates. It accelerates the process of identifying compounds that are likely to be safe and effective.

Working with the model

DhanshreeA commented 9 months ago

Task 2 - Suggest a new model and document it (2)

Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None

This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023

Paper Link 🔗: https://arxiv.org/abs/2308.05115

Overview

* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases.

* Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases

* The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein.

* The research paper mentions examples of previous models like:

  * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites.

* However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1)

* The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs.

* PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.

Why the model is relevant to Ersilia

1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.

2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.

3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

* The model’s data is provided in the **_‘data’_** folder in the GitHub repository.

* The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file.

* The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC)

* We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py`

* The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively.

* Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.

Hi @boazleleina thank you for very interesting paper suggestion! Although it seems from the literature that the model should work with simple SMILES strings, but upon inspection of the code, I could not identify what exactly would be the inputs and outputs for this model. Of course I might have missed something, so I'm curious to know more about your understanding of the code. Thank you very much!

boazleleina commented 9 months ago

Task 2 - Suggest a new model and document it (2)

Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023 Paper Link 🔗: https://arxiv.org/abs/2308.05115 Overview

* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases.

* Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases

* The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein.

* The research paper mentions examples of previous models like:

  * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites.

* However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1)

* The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs.

* PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.

Why the model is relevant to Ersilia

1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.

2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.

3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

* The model’s data is provided in the **_‘data’_** folder in the GitHub repository.

* The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file.

* The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC)

* We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py`

* The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively.

* Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.

Hi @boazleleina thank you for very interesting paper suggestion! Although it seems from the literature that the model should work with simple SMILES strings, but upon inspection of the code, I could not identify what exactly would be the inputs and outputs for this model. Of course I might have missed something, so I'm curious to know more about your understanding of the code. Thank you very much!

Thank you for your response @DhanshreeA. From my understanding going through the paper and code, the model is trained using labeled datasets, which are locations within a protein sequence where amino acids are found and these are in the form of strings. For this model, they used Y-Sites and S/T Sites, some of the 20 standard amino acids commonly found in proteins. These amino acid sites have already been pretrained by ProtTrans which provides state-of-the-art pre-trained models for proteins that according to the page, "were trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models", and they provide a link to where they found this data here. The output of the model is the probability of amino acids being found in specific protein sequence and their location. The dataset is provided but we can also generate them from the original location here. The model currently works with Protein Sequences as strings as it is; but is open for editing to find out if it can work with other string options like SMILES. Please let me know if this is within the scope and if I need to clarify something. Thank you.

kbatya commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git

The project has some package requirements that it needs to run including: 📦

  • python 3.7
  • pytorch 1.5.0
  • openbabel 2.4.1
  • rdkit
  • scikit learn
  • scipy
  • cairosvg
  • pandas
  • matplotlib
  • sklearn

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:

conda create --name IDLenv python=3.7
conda activate IDLenv

Installing Dependencies 🧩

  1. python 3.7 - Creating the conda environment running this Python version
  2. pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

    • " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
  3. openbabel 2.4.1 - pip install openbabel openbableerror.log

    • After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel
    • In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.
  4. rdkit - pip install rdkit
  5. scikit learn - pip install scikit-learn
  6. scipy - pip install scipy
  7. cairosvg - pip install cairosvg
  8. pandas - pip install pandas
  9. matplotlib - pip install matplotlib

Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

  • The first cell generated an ipykernel package problem, I found a workaround by running: conda activate "D:\ERSILIA MODEL\.conda" conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall

    • This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.
  • I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran: pip install torch
  • Another error that was generated was

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

  • I realized that was because I had not installed numpy and just ran the code: pip install numpy
  • I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:
    • Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

SOLVING CUDA ISSUE🔄

  • The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.

Steps I took to change this:🌓

  1. In the folders in AttentiveFP directory made changes to the files:

    • AttentiveLayers_viz.py
    • AttentiveLayers.py
  • I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work
  1. In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
  • In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor
  • In Load the model function cell:

    • I commented out the lines # Remove.cuda() calls # model.cuda()
    • I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Model Success🎉

The model successfully predicted the values:

predictedvalues.log

The graphs were generated by AttentionFP and I could see the structure of different substructure molecules

graphgenerated.log graphgeneratedimg.pdf

graph2generated.log graph2generatedimg.pdf

graph3generated.log graph3generatedimg.pdf

graph4generated.log graph4generatedimg.pdf

  • The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures

substructures.log

  • The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!

boazleleina commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git The project has some package requirements that it needs to run including: 📦

  • python 3.7
  • pytorch 1.5.0
  • openbabel 2.4.1
  • rdkit
  • scikit learn
  • scipy
  • cairosvg
  • pandas
  • matplotlib
  • sklearn

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:

conda create --name IDLenv python=3.7
conda activate IDLenv

Installing Dependencies 🧩

  1. python 3.7 - Creating the conda environment running this Python version
  2. pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

    • " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
  3. openbabel 2.4.1 - pip install openbabel openbableerror.log

    • After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel
    • In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.
  4. rdkit - pip install rdkit
  5. scikit learn - pip install scikit-learn
  6. scipy - pip install scipy
  7. cairosvg - pip install cairosvg
  8. pandas - pip install pandas
  9. matplotlib - pip install matplotlib

Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

  • The first cell generated an ipykernel package problem, I found a workaround by running: conda activate "D:\ERSILIA MODEL\.conda" conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall

    • This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.
  • I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran: pip install torch
  • Another error that was generated was

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

  • I realized that was because I had not installed numpy and just ran the code: pip install numpy
  • I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:

    • Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

SOLVING CUDA ISSUE🔄

  • The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.

Steps I took to change this:🌓

  1. In the folders in AttentiveFP directory made changes to the files:

    • AttentiveLayers_viz.py
    • AttentiveLayers.py
  • I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work
  1. In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
  • In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor
  • In Load the model function cell:

    • I commented out the lines # Remove.cuda() calls # model.cuda()
    • I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Model Success🎉 The model successfully predicted the values: predictedvalues.log The graphs were generated by AttentionFP and I could see the structure of different substructure molecules graphgenerated.log graphgeneratedimg.pdf graph2generated.log graph2generatedimg.pdf graph3generated.log graph3generatedimg.pdf graph4generated.log graph4generatedimg.pdf

  • The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures

substructures.log

  • The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!

You're very welcome, I am glad I could be of help😁

luiscamachocaballero commented 9 months ago

Hi @boazleleina!, I followed your steps to overcome CUDA problem, but I still keep having issues, I think an small thing is missing, I'd appreciate your help. Below is the output error when I run the IDL-PPBopt.ipynb file:

AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_868072/887561112.py in <module>
     14 loss_function = nn.MSELoss()
     15 model = Fingerprint(radius, T, num_atom_features, num_bond_features,
---> 16             fingerprint_dim, output_units_num, p_dropout)
     17 #model.cuda()
     18 

~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout)
     10         super(Fingerprint, self).__init__()
     11         # graph attention for atom embedding
---> 12         self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim)
     13         self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim)
     14         self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)])

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     70         self.in_features = in_features
     71         self.out_features = out_features
---> 72         self.weight = Parameter(torch.Tensor(out_features, in_features))
     73         if bias:
     74             self.bias = Parameter(torch.Tensor(out_features))

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
    147             raise RuntimeError(
    148                 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149         _check_driver()
    150         if _cudart is None:
    151             raise AssertionError(

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver()
     52 Found no NVIDIA driver on your system. Please check that you
     53 have an NVIDIA GPU and installed a driver from
---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22))
     55         else:
     56             # TODO: directly link to the alternative bin that needs install

AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
boazleleina commented 9 months ago

Hi @boazleleina!, I followed your steps to overcome CUDA problem, but I still keep having issues, I think an small thing is missing, I'd appreciate your help. Below is the output error when I run the IDL-PPBopt.ipynb file:

AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_868072/887561112.py in <module>
     14 loss_function = nn.MSELoss()
     15 model = Fingerprint(radius, T, num_atom_features, num_bond_features,
---> 16             fingerprint_dim, output_units_num, p_dropout)
     17 #model.cuda()
     18 

~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout)
     10         super(Fingerprint, self).__init__()
     11         # graph attention for atom embedding
---> 12         self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim)
     13         self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim)
     14         self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)])

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     70         self.in_features = in_features
     71         self.out_features = out_features
---> 72         self.weight = Parameter(torch.Tensor(out_features, in_features))
     73         if bias:
     74             self.bias = Parameter(torch.Tensor(out_features))

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
    147             raise RuntimeError(
    148                 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149         _check_driver()
    150         if _cudart is None:
    151             raise AssertionError(

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver()
     52 Found no NVIDIA driver on your system. Please check that you
     53 have an NVIDIA GPU and installed a driver from
---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22))
     55         else:
     56             # TODO: directly link to the alternative bin that needs install

AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

From the error I can read here @luiscamachocaballero it seems the code is still trying to run on cuda. Did you make changes to the AttentiveFP files mentioned in my issue? Also after commenting out model.cuda please replace it with model.cpu. I have included a snapshot of the lines that seem to be giving you the error. Try coping my edited code and using it in your file and see if it will solve the issue. If all the cells before that ran without errors then the edit should work. Feel free to reach out in case of any more errors.

loss_function = nn.MSELoss() model = Fingerprint(radius, T, num_atom_features, num_bond_features, fingerprint_dim, output_units_num, p_dropout) model.cpu()

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_' + '54' + '.pt', map_location=torch.device('cpu'))

best_model_dict = best_model.state_dict() best_model_wts = copy.deepcopy(best_model_dict)

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!