Closed boazleleina closed 8 months ago
After being accepted to the Outreachy program, I selected the Ersilia project after reading up on all the projects as I connected to their research, especially in the field of disease research in ML. As a long supporter of ML for good, I feel this project is doing wonderful work and hope to be part of this amazing journey.
I also successfully created this issue🎉
As I am running Windows OS on my machine, I will be using WSL with Ubuntu 22.04.3 LTS
I am also running
conda version 23.9.0
Python 3.7.16
I successfully ran ersilia and installed all the required prerequisites. I followed the instructions found here.
I installed and activated ersila, I also installed the Isaura data lake in the prerequisites.
I installed the Ersilia Python Package by running:
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
pip install -e .
I installed docker and ran it and was able to note a container running after I fetched the model.
Running a simple model on ersilia
I checked the model catalogs using:
ersilia catalog
Ran the following commands, they ran without errors:
ersilia -v fetch eos3b5e
ersilia serve eos3b5e
However, after running the code:
ersilia -v run -i "CCCC"
I ran into the error TypeError: object of type 'NoneType' has no len()
Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account
Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:
Thanks.
Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:
- Do not paste images, you can attach a .log file, it is easier to review errors.
- Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.
- Do not modify the base code. Create another environment and install ersilia.
Thanks.
This is well noted @carcablop, thank you. I am making the required changes and editing my issue
Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account
Thank you @leilayesufu, I will take this into account when making my changes. I appreciate it.
Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:
- Do not paste images, you can attach a .log file, it is easier to review errors.
- Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.
- Do not modify the base code. Create another environment and install ersilia.
Thanks. I made the requested changes. I also included the log files. After reinstalling the environment and running the code again, I ran into the same error
Hi, did you specify your python version. conda create -n ersilia python=3.7
Hi, did you specify your python version.
conda create -n ersilia python=3.7
Yes I did, I also reinstalled the environment again and started from scratch, it still generated the same error for me
I have been facing the issue of the error: TypeError: object of type 'NoneType' has no len()
Steps to recreate the error:
conda create -n ersilia python=3.10
conda activate ersilia
python -m pip install isaura==0.1
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
pip install -e .
ersilia -v fetch eos3b5e
ersilia serve eos3b5e
Up to this point all the code ran as expected without producing any errors, the main issue was with the code:
ersilia -v run -i "CCCC"
Below is the log file of the error: runmodel.log
Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?
Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?
- Reinstall conda and reinstall Ersilia with Python 3.7
- Try installing Ersilia with a version of Python greater than 3.7
I retried the steps using miniconda and python 3.10, the error still persists. I deleted everything initially and created a new environment using miniconda before retrying, the issue is with the same error
The error seems to be persistent for wsl: TypeError: object of type 'NoneType' has no len()
I retried the steps mentioned in the above issue using miniconda and both Python 3.10 and 3.7 Below is the log file while using miniconda and Python 3.7 logfile.log
Hi @Boadiwaa. Try uninstall the isaura package and try running the model again. It seems that Isaura is causing that error.
After following instructions from @carcablop, the model ran successfully. Steps I took:
Uninstall Isaura
pip uninstall isaura==0.1
Ran the fetch model
ersilia -v fetch eos3b5e
Served the model
ersilia serve eos3b5e
Ran the model
ersilia -v api run -i "CCCC"
The above steps worked for WSL with Python 3.7
running
Task 4: Write a Motivation Statement to work at Ersilia
I am writing this to express my excitement about the opportunity to be a part of the Ersilia team. I believe in the strong alignment between my journey in AI/ML and the objectives of Ersilia. I am a graduate Software Engineer and AI/ML practitioner with experience working with Machine Learning and Deep Learning models. In my time in the field, I have been a staunch believer of AI for good as I believe we can use this technology to bring positive changes and impact in society that can take us to a better future. I have had the privilege of working on projects and experiences that are aligned towards this goal of AI/ML for good. Working on a machine learning project to track the travel patterns of pastoralist communities and their animals in Northern Kenya was one of my most fulfilling experiences. This effort was essential in helping authorities plan and allocate resources, such as security and medication, along their routes. This first-hand encounter made clear to me the ability of technology to significantly improve people's lives. Growing up in a marginalized community with limited resources, I have always believed in the responsibility we all share to support those in more vulnerable positions. This principle is at the core of Ersilia's mission to support research in Low-Income Countries (LICs). It deeply resonates with me, as I have seen the challenges faced by communities in such regions. I am genuinely excited about the prospect of contributing to research efforts that can improve healthcare and create a more equitable world. I have also witnessed the devastating impact of inadequately researched diseases like Rift Valley Fever on my own family, through the loss of my grandfather, and the greater community. The lack of sufficient research and data on these rare diseases has left many vulnerable. Therefore, I wholeheartedly believe in the potential of the work being done by Ersilia to combat future outbreaks and save lives. I am eagerly looking forward to the research that will be conducted, knowing that it can make a significant difference, not only now but for generations to come. If granted the opportunity to intern at Ersilia, I am fully committed to continuing the outstanding work of creating AI/ML models focused on disease research. I will bring to the table not only my technical skills but also a deep sense of purpose and dedication. I understand that the work I do here has the power to transform lives and leave a lasting impact on society. I'm determined to continue my career in artificial intelligence after the internship is over. My ultimate objective is to establish a setting where AI brings people together for the purpose of raising the standard of living. I am really excited about the chance to work with like-minded people and truly make a difference.
Task 5: Submit your first contribution to the Outreachy site
I made my first contribution to the Outreachy website and the contribution was recorded
Task 1: Select a model from the suggested list
I selected Plasma Protein Binding (IDL-PPBopt)
Reason for choosing the model
The focus of this research is on understanding and manipulating how chemical compounds bind to human plasma proteins. This is a critical aspect of pharmacology and drug development because it affects the distribution and activity of drugs in the human body. In a world with so many options for drugs and high risks and side effects involved with each drug, this research is important in helping determine how each drug affects the body and guiding the development of better drugs to combat diseases in the future. The research employs deep learning techniques, a subset of machine learning, to accomplish the prediction and optimization goals. Which would be an interesting challenge for me. I look forward to working with AttentionFP algorithm (Attentive Feature Propagation for Molecular Property Prediction) designed for the prediction of molecular properties, particularly in the field of chemoinformatics and drug discovery. It leverages a deep learning architecture to capture and analyze the structural information of chemical compounds in a way that is suitable for property prediction tasks. AttentionFP is built on Graph Neural Networks, which is quite interesting as molecular data, chemical compounds can be represented as graphs, where atoms are nodes, and chemical bonds are edges. This can be seen in some outputs in the notebook which makes it easier to understand and visualize.
Task 2: Install the model in your system
To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran:
git clone https://github.com/Louchaofeng/IDL-PPBopt.git
The project has some package requirements that it needs to run including: 📦
Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:
conda create --name IDLenv python=3.7
conda activate IDLenv
python 3.7 - Creating the conda environment running this Python version
pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html
openbabel 2.4.1 - pip install openbabel
openbableerror.log
conda install -c openbabel openbabel
conda install -c conda-forge openbabel==2.4.1
as it forces updates of other modules in the environment and thus is safer to run the code above.rdkit - pip install rdkit
scikit learn - pip install scikit-learn
scipy - pip install scipy
cairosvg - pip install cairosvg
pandas - pip install pandas
matplotlib - pip install matplotlib
The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:
The first cell generated an ipykernel package problem, I found a workaround by running:
conda activate "D:\ERSILIA MODEL\.conda"
conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall
I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran:
pip install torch
Another error that was generated was
d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)
pip install numpy
SOLVING CUDA ISSUE🔄
Steps I took to change this:🌓
In the folders in AttentiveFP directory made changes to the files:
torch.cuda.sum()
would become torch.sum()
, torch.cpu.sum()
would also workIn the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
torch.cuda.LongTensor
to torch.LongTensor
In Load the model function cell:
I commented out the lines
# Remove.cuda() calls
# model.cuda()
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')
Model Success🎉
The model successfully predicted the values:
The graphs were generated by AttentionFP and I could see the structure of different substructure molecules
graphgenerated.log graphgeneratedimg.pdf
graph2generated.log graph2generatedimg.pdf
graph3generated.log graph3generatedimg.pdf
graph4generated.log graph4generatedimg.pdf
with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')
Hi @boazleleina thanks for the detailed updates! Could you comment on the difference between the model you ran locally and the one present within the hub? Specifically which substructures are different and/or missing between the two outputs?
I used the provided Essential Medicines List in the form of eml_canonical.csv
file and loaded the dataset to my model using a notebook file I created eml_canonical.ipynb
This file had the essential packages and only the code used for the prediction that I used to run the new dataset which was copied from the original IDL-PPBopt.ipynb
file
My model prediction for the eml_canonical also ran within the same virtual environment and inherited from the same dependencies.
I edited the file running predictions for the Essential Medicines List to point at the smiles
column instead of cano_smiles
, which is the appropriate column name for the eml_canonical.csv
which I used to run the predictions.
The model ran successfully and made predictions for the drugs using the smiles column.
The successful log and the predictions(stored in a csv) are attached below🔽
eml_canonical_predictions.csv
is 433, this is made clear in the log file as 9 compounds could not be features as noted in the output recorded in the log file.Explanation of the outcome The model predicts the likelihood of various drugs binding to proteins in human blood plasma, using deep learning model, the predicted values range between 0 and 1, with the values closer to 1 showing a higher probability of the drug binding and those closer to 0 showing a lower probability of the drug binding to the proteins in the human blood plasma
This model is a Neural Network (with appropriate activation functions) that is used for the regression task, specifically using AttentiveFP, which is built on the Graph Neural Network with feedforward propagation, and different activation functions including linear, leakyrelu, relu and softmax.
I launched docker desktop to activate docker commands within WSL. The docker I am running has WSL 2 based engine activated, thus launching the application automatically connects to WSL.
I ran the code, docker pull ersiliaos/eos22io
, to pull the IDL-PPB image from Ersilia Hub to docker, the model pulled successfully and began running, the logs of the process were generated and attached below:
I copied the eml csv file to the docker container using the code:
docker cp /mnt/c/Users/Administrator/Downloads/eml_canonical.csv jovial_hodgkin:/root
After copying the csv file to the docker environment, I opened the docker interactive environment using:
docker exec -it jovial_hodgkin /bin/bash
After executing the docker container and opening the interactive environment, I ran predictions for the eml_canonical dataset successfully:
ersilia -v api run -i eml_canonical.csv -o eml_ersilia_output.csv
To copy the prediction values from docker to my local computer I ran the code:
docker cp jovial_hodgkin:/root/ eml_ersilia_output.csv /mnt/c/Users/Administrator/Downloads
Differences on the original code and Ersilia model Hub prediction values:
The model from Ersilia Hub was able to run 442 predictions, this means that 9 compounds that could not be features in the original code were features in the hub code and prediction values equated to NULL, the nine features are listed below:
key | input |
---|---|
FAQLAUHZSGTTLN-UHFFFAOYSA-N | [CaH2] |
KRHYYFGTRYWZRS-UHFFFAOYSA-M | [F-]l |
ZCYVEMRRCGMTRW-UHFFFAOYSA-N | [I] |
XLYOFNOQVPJJNP-UHFFFAOYSA-N | O |
WCUXLLCKKVVCTQ-UHFFFAOYSA-M | [Cl-].[K+] |
NLKNQRATVPKPDG-UHFFFAOYSA-M | [K+].[I-] |
RWSOTUBLDIXVET-UHFFFAOYSA-N | S |
FJKGRAZQBBWYLG-UHFFFAOYSA-M | N.N.[F-].[Ag+] |
FAPWRFPIFSIZLT-UHFFFAOYSA-M | [Na+].[Cl-] |
Great job @boazleleina thank you for the updates!
Model Name : Controlled peptide generation Model Link🔗 : https://github.com/IBM/controlled-peptide-generation/tree/master Model License 📑: Apache-2.0 license
This model uses the Pytorch framework and is generated from the research paper “Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics” by Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu Cipcigan, Vijil Chenthamarakshan, Hendrik Strobelt, Cicero dos Santos, Pin-Yu Chen, Yi Yan Yang, Jeremy Tan, James Hedrick, Jason Crain, Aleksandra Mojsilovic, submitted on 22nd May 2020 and revised on 26th Feb 2021, in the Machine Learning section of arxiv.org
Paper Link🔗: https://arxiv.org/abs/2005.11248
Overview
Why the model is relevant to Ersilia
Working with the model
Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None
This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023
Paper Link 🔗: https://arxiv.org/abs/2308.05115
Overview
Why the model is relevant to Ersilia
Working with the model
python model_train_test/pretrained_embedding_generate.py
file. ./model_train_test/train.py
,to train the PTransIPs model in ./model_train_test/PTransIPs_model.py
./model_train_test/umap_test.py
will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.
Model name: Multitask-toxicity Model Link🔗: https://github.com/IBM/multitask-toxicity#accurate--clinical-toxicity-prediction-using-multi-task-deep-neural-nets-and-contrastive-molecular-explanations Model License 📑: Apache-2.0 license
This model uses the Pytorch framework, and is generated from the research paper, “Accurate Clinical Toxicity Prediction using Multi-task Deep Neural Nets and Contrastive Molecular Explanations.” By Bhanushee Sharma, Vijil Chenthamarakshan, Amit Dhurandhar, Shiranee Pereira, James A. Hendler, Jonathan S. Dordick, and Payel Das
Paper Link🔗: https://arxiv.org/abs/2204.06614
Overview
Why the model is relevant to Ersilia
Working with the model
Task 2 - Suggest a new model and document it (2)
Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None
This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023
Paper Link 🔗: https://arxiv.org/abs/2308.05115
Overview
* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases. * Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases * The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein. * The research paper mentions examples of previous models like: * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites. * However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1) * The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs. * PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.
Why the model is relevant to Ersilia
1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets. 2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development. 3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.
Working with the model
* The model’s data is provided in the **_‘data’_** folder in the GitHub repository. * The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file. * The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC) * We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py` * The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively. * Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.
Hi @boazleleina thank you for very interesting paper suggestion! Although it seems from the literature that the model should work with simple SMILES strings, but upon inspection of the code, I could not identify what exactly would be the inputs and outputs for this model. Of course I might have missed something, so I'm curious to know more about your understanding of the code. Thank you very much!
Task 2 - Suggest a new model and document it (2)
Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023 Paper Link 🔗: https://arxiv.org/abs/2308.05115 Overview
* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases. * Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases * The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein. * The research paper mentions examples of previous models like: * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites. * However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1) * The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs. * PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.
Why the model is relevant to Ersilia
1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets. 2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development. 3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.
Working with the model
* The model’s data is provided in the **_‘data’_** folder in the GitHub repository. * The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file. * The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC) * We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py` * The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively. * Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.
Hi @boazleleina thank you for very interesting paper suggestion! Although it seems from the literature that the model should work with simple SMILES strings, but upon inspection of the code, I could not identify what exactly would be the inputs and outputs for this model. Of course I might have missed something, so I'm curious to know more about your understanding of the code. Thank you very much!
Thank you for your response @DhanshreeA. From my understanding going through the paper and code, the model is trained using labeled datasets, which are locations within a protein sequence where amino acids are found and these are in the form of strings. For this model, they used Y-Sites and S/T Sites, some of the 20 standard amino acids commonly found in proteins. These amino acid sites have already been pretrained by ProtTrans which provides state-of-the-art pre-trained models for proteins that according to the page, "were trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models", and they provide a link to where they found this data here. The output of the model is the probability of amino acids being found in specific protein sequence and their location. The dataset is provided but we can also generate them from the original location here. The model currently works with Protein Sequences as strings as it is; but is open for editing to find out if it can work with other string options like SMILES. Please let me know if this is within the scope and if I need to clarify something. Thank you.
Task 2: Install the model in your system
IDL-PPBOT RUNNING STEPS🏃♂️💨
To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran:
git clone https://github.com/Louchaofeng/IDL-PPBopt.git
The project has some package requirements that it needs to run including: 📦
- python 3.7
- pytorch 1.5.0
- openbabel 2.4.1
- rdkit
- scikit learn
- scipy
- cairosvg
- pandas
- matplotlib
- sklearn
Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:
conda create --name IDLenv python=3.7 conda activate IDLenv
Installing Dependencies 🧩
- python 3.7 - Creating the conda environment running this Python version
pytorch 1.5.0 -
pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
openbabel 2.4.1 -
pip install openbabel
openbableerror.log
- After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue:
conda install -c openbabel openbabel
- In my research to fix the openbabel issue I found that it is discouraged to run:
conda install -c conda-forge openbabel==2.4.1
as it forces updates of other modules in the environment and thus is safer to run the code above.- rdkit -
pip install rdkit
- scikit learn -
pip install scikit-learn
- scipy -
pip install scipy
- cairosvg -
pip install cairosvg
- pandas -
pip install pandas
- matplotlib -
pip install matplotlib
Running our model 🔗
The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:
The first cell generated an ipykernel package problem, I found a workaround by running:
conda activate "D:\ERSILIA MODEL\.conda"
conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall
- This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.
- I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran:
pip install torch
- Another error that was generated was
d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)
- I realized that was because I had not installed numpy and just ran the code:
pip install numpy
- I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:
- Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical
SOLVING CUDA ISSUE🔄
- The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.
Steps I took to change this:🌓
In the folders in AttentiveFP directory made changes to the files:
- AttentiveLayers_viz.py
- AttentiveLayers.py
- I removed all cuda references in these files and changed them to 'cpu' Example:
torch.cuda.sum()
would becometorch.sum()
,torch.cpu.sum()
would also work
- In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
- In the Related function cell, I made changes like:
torch.cuda.LongTensor
totorch.LongTensor
In Load the model function cell:
- I commented out the lines
# Remove.cuda() calls
# model.cuda()
- I also mapped location to run the model on the cpu by editing the line of code:
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')
Model Success🎉
The model successfully predicted the values:
The graphs were generated by AttentionFP and I could see the structure of different substructure molecules
graphgenerated.log graphgeneratedimg.pdf
graph2generated.log graph2generatedimg.pdf
graph3generated.log graph3generatedimg.pdf
graph4generated.log graph4generatedimg.pdf
- The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures
- The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log
with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')
Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!
Task 2: Install the model in your system
IDL-PPBOT RUNNING STEPS🏃♂️💨
To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran:
git clone https://github.com/Louchaofeng/IDL-PPBopt.git
The project has some package requirements that it needs to run including: 📦
- python 3.7
- pytorch 1.5.0
- openbabel 2.4.1
- rdkit
- scikit learn
- scipy
- cairosvg
- pandas
- matplotlib
- sklearn
Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:
conda create --name IDLenv python=3.7 conda activate IDLenv
Installing Dependencies 🧩
- python 3.7 - Creating the conda environment running this Python version
pytorch 1.5.0 -
pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
openbabel 2.4.1 -
pip install openbabel
openbableerror.log
- After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue:
conda install -c openbabel openbabel
- In my research to fix the openbabel issue I found that it is discouraged to run:
conda install -c conda-forge openbabel==2.4.1
as it forces updates of other modules in the environment and thus is safer to run the code above.- rdkit -
pip install rdkit
- scikit learn -
pip install scikit-learn
- scipy -
pip install scipy
- cairosvg -
pip install cairosvg
- pandas -
pip install pandas
- matplotlib -
pip install matplotlib
Running our model 🔗
The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:
The first cell generated an ipykernel package problem, I found a workaround by running:
conda activate "D:\ERSILIA MODEL\.conda"
conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall
- This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.
- I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran:
pip install torch
- Another error that was generated was
d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)
- I realized that was because I had not installed numpy and just ran the code:
pip install numpy
I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:
- Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical
SOLVING CUDA ISSUE🔄
- The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.
Steps I took to change this:🌓
In the folders in AttentiveFP directory made changes to the files:
- AttentiveLayers_viz.py
- AttentiveLayers.py
- I removed all cuda references in these files and changed them to 'cpu' Example:
torch.cuda.sum()
would becometorch.sum()
,torch.cpu.sum()
would also work
- In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
- In the Related function cell, I made changes like:
torch.cuda.LongTensor
totorch.LongTensor
In Load the model function cell:
- I commented out the lines
# Remove.cuda() calls
# model.cuda()
- I also mapped location to run the model on the cpu by editing the line of code:
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')
Model Success🎉 The model successfully predicted the values: predictedvalues.log The graphs were generated by AttentionFP and I could see the structure of different substructure molecules graphgenerated.log graphgeneratedimg.pdf graph2generated.log graph2generatedimg.pdf graph3generated.log graph3generatedimg.pdf graph4generated.log graph4generatedimg.pdf
- The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures
- The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log
with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')
Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!
You're very welcome, I am glad I could be of help😁
Hi @boazleleina!, I followed your steps to overcome CUDA problem, but I still keep having issues, I think an small thing is missing, I'd appreciate your help. Below is the output error when I run the IDL-PPBopt.ipynb file:
AssertionError Traceback (most recent call last)
/tmp/ipykernel_868072/887561112.py in <module>
14 loss_function = nn.MSELoss()
15 model = Fingerprint(radius, T, num_atom_features, num_bond_features,
---> 16 fingerprint_dim, output_units_num, p_dropout)
17 #model.cuda()
18
~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout)
10 super(Fingerprint, self).__init__()
11 # graph attention for atom embedding
---> 12 self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim)
13 self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim)
14 self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)])
~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
70 self.in_features = in_features
71 self.out_features = out_features
---> 72 self.weight = Parameter(torch.Tensor(out_features, in_features))
73 if bias:
74 self.bias = Parameter(torch.Tensor(out_features))
~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
147 raise RuntimeError(
148 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149 _check_driver()
150 if _cudart is None:
151 raise AssertionError(
~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver()
52 Found no NVIDIA driver on your system. Please check that you
53 have an NVIDIA GPU and installed a driver from
---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22))
55 else:
56 # TODO: directly link to the alternative bin that needs install
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
Hi @boazleleina!, I followed your steps to overcome CUDA problem, but I still keep having issues, I think an small thing is missing, I'd appreciate your help. Below is the output error when I run the IDL-PPBopt.ipynb file:
AssertionError Traceback (most recent call last) /tmp/ipykernel_868072/887561112.py in <module> 14 loss_function = nn.MSELoss() 15 model = Fingerprint(radius, T, num_atom_features, num_bond_features, ---> 16 fingerprint_dim, output_units_num, p_dropout) 17 #model.cuda() 18 ~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout) 10 super(Fingerprint, self).__init__() 11 # graph attention for atom embedding ---> 12 self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim) 13 self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim) 14 self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)]) ~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias) 70 self.in_features = in_features 71 self.out_features = out_features ---> 72 self.weight = Parameter(torch.Tensor(out_features, in_features)) 73 if bias: 74 self.bias = Parameter(torch.Tensor(out_features)) ~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init() 147 raise RuntimeError( 148 "Cannot re-initialize CUDA in forked subprocess. " + msg) --> 149 _check_driver() 150 if _cudart is None: 151 raise AssertionError( ~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver() 52 Found no NVIDIA driver on your system. Please check that you 53 have an NVIDIA GPU and installed a driver from ---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22)) 55 else: 56 # TODO: directly link to the alternative bin that needs install AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
From the error I can read here @luiscamachocaballero it seems the code is still trying to run on cuda. Did you make changes to the AttentiveFP files mentioned in my issue? Also after commenting out model.cuda
please replace it with model.cpu
. I have included a snapshot of the lines that seem to be giving you the error. Try coping my edited code and using it in your file and see if it will solve the issue. If all the cells before that ran without errors then the edit should work. Feel free to reach out in case of any more errors.
loss_function = nn.MSELoss()
model = Fingerprint(radius, T, num_atom_features, num_bond_features, fingerprint_dim, output_units_num, p_dropout)
model.cpu()
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_' + '54' + '.pt', map_location=torch.device('cpu'))
best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application