✍️ Contribution period: Boaz Leleina

ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

189 stars 123 forks source link

✍️ Contribution period: Boaz Leleina #830

Closed boazleleina closed 8 months ago

boazleleina commented 9 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

boazleleina commented 9 months ago

slack channel After being accepted to the Outreachy program, I selected the Ersilia project after reading up on all the projects as I connected to their research, especially in the field of disease research in ML. As a long supporter of ML for good, I feel this project is doing wonderful work and hope to be part of this amazing journey.

I also successfully created this issue🎉

boazleleina commented 9 months ago

As I am running Windows OS on my machine, I will be using WSL with Ubuntu 22.04.3 LTS

ubuntuversion.log

I am also running

conda version 23.9.0
Python 3.7.16

I successfully ran ersilia and installed all the required prerequisites. I followed the instructions found here.

GITLFS.log

I installed and activated ersila, I also installed the Isaura data lake in the prerequisites.

Isaura.log

I installed the Ersilia Python Package by running: git clone https://github.com/ersilia-os/ersilia.git cd ersilia pip install -e .

developer.log

I installed docker and ran it and was able to note a container running after I fetched the model.

boazleleina commented 9 months ago

Running a simple model on ersilia

I checked the model catalogs using: ersilia catalog

catalog.log

Ran the following commands, they ran without errors: ersilia -v fetch eos3b5e ersilia serve eos3b5e

serve.log

However, after running the code: ersilia -v run -i "CCCC"

I ran into the error TypeError: object of type 'NoneType' has no len()

runmodel.log

leilayesufu commented 9 months ago

Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account

carcablop commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

Do not paste images, you can attach a .log file, it is easier to review errors.
Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.
Do not modify the base code. Create another environment and install ersilia.

Thanks.

boazleleina commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

Do not paste images, you can attach a .log file, it is easier to review errors.

Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.

Do not modify the base code. Create another environment and install ersilia.

Thanks.

This is well noted @carcablop, thank you. I am making the required changes and editing my issue

boazleleina commented 9 months ago

Hi, The base code should not be changed. So the changes shouldn't be made. Kindly take this into account

Thank you @leilayesufu, I will take this into account when making my changes. I appreciate it.

boazleleina commented 9 months ago

Hello @boazleleina Welcome to Ersilia!. For the contribution period, take into account the following:

Do not paste images, you can attach a .log file, it is easier to review errors.

Please provide a description of your system and development environment, Ubuntu version, Python version and Conda version.

Do not modify the base code. Create another environment and install ersilia.

Thanks. I made the requested changes. I also included the log files. After reinstalling the environment and running the code again, I ran into the same error

leilayesufu commented 9 months ago

Hi, did you specify your python version. conda create -n ersilia python=3.7

boazleleina commented 9 months ago

Hi, did you specify your python version. conda create -n ersilia python=3.7

Yes I did, I also reinstalled the environment again and started from scratch, it still generated the same error for me

boazleleina commented 9 months ago

I have been facing the issue of the error: TypeError: object of type 'NoneType' has no len()

Steps to recreate the error:

I am running wsl with ubuntu version 22.04.3 LTS
I am also using conda version conda version 23.9.0 and Python 3.10 and also initially Python 3.7.16

conda create -n ersilia python=3.10
conda activate ersilia
python -m pip install isaura==0.1
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
pip install -e .
ersilia -v fetch eos3b5e
ersilia serve eos3b5e

Up to this point all the code ran as expected without producing any errors, the main issue was with the code: ersilia -v run -i "CCCC"

Below is the log file of the error: runmodel.log

DhanshreeA commented 9 months ago

Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?

Reinstall conda and reinstall Ersilia with Python 3.7
Try installing Ersilia with a version of Python greater than 3.7

boazleleina commented 9 months ago

Hi @boazleleina thanks for your efforts. It seems other users running Ersilia on Ubuntu within WSL are also facing a similar issue. While Ersilia supports Python versions >=3.7, and it should work with a reasonably old version of conda, there maybe some issues specific to WSL. Could you do the following and report your progress here?

Reinstall conda and reinstall Ersilia with Python 3.7

Try installing Ersilia with a version of Python greater than 3.7

I retried the steps using miniconda and python 3.10, the error still persists. I deleted everything initially and created a new environment using miniconda before retrying, the issue is with the same error

boazleleina commented 9 months ago

The error seems to be persistent for wsl: TypeError: object of type 'NoneType' has no len()

I retried the steps mentioned in the above issue using miniconda and both Python 3.10 and 3.7 Below is the log file while using miniconda and Python 3.7 logfile.log

carcablop commented 9 months ago

Hi @Boadiwaa. Try uninstall the isaura package and try running the model again. It seems that Isaura is causing that error.

boazleleina commented 9 months ago

After following instructions from @carcablop, the model ran successfully. Steps I took:

Uninstall Isaura pip uninstall isaura==0.1
Ran the fetch model ersilia -v fetch eos3b5e
Served the model ersilia serve eos3b5e
Ran the model ersilia -v api run -i "CCCC"

The above steps worked for WSL with Python 3.7 running

runmodelubuntu.log

boazleleina commented 9 months ago

Task 4: Write a Motivation Statement to work at Ersilia

MOTIVATION TO WORK AT ERSILIA

I am writing this to express my excitement about the opportunity to be a part of the Ersilia team. I believe in the strong alignment between my journey in AI/ML and the objectives of Ersilia. I am a graduate Software Engineer and AI/ML practitioner with experience working with Machine Learning and Deep Learning models. In my time in the field, I have been a staunch believer of AI for good as I believe we can use this technology to bring positive changes and impact in society that can take us to a better future. I have had the privilege of working on projects and experiences that are aligned towards this goal of AI/ML for good. Working on a machine learning project to track the travel patterns of pastoralist communities and their animals in Northern Kenya was one of my most fulfilling experiences. This effort was essential in helping authorities plan and allocate resources, such as security and medication, along their routes. This first-hand encounter made clear to me the ability of technology to significantly improve people's lives. Growing up in a marginalized community with limited resources, I have always believed in the responsibility we all share to support those in more vulnerable positions. This principle is at the core of Ersilia's mission to support research in Low-Income Countries (LICs). It deeply resonates with me, as I have seen the challenges faced by communities in such regions. I am genuinely excited about the prospect of contributing to research efforts that can improve healthcare and create a more equitable world. I have also witnessed the devastating impact of inadequately researched diseases like Rift Valley Fever on my own family, through the loss of my grandfather, and the greater community. The lack of sufficient research and data on these rare diseases has left many vulnerable. Therefore, I wholeheartedly believe in the potential of the work being done by Ersilia to combat future outbreaks and save lives. I am eagerly looking forward to the research that will be conducted, knowing that it can make a significant difference, not only now but for generations to come. If granted the opportunity to intern at Ersilia, I am fully committed to continuing the outstanding work of creating AI/ML models focused on disease research. I will bring to the table not only my technical skills but also a deep sense of purpose and dedication. I understand that the work I do here has the power to transform lives and leave a lasting impact on society. I'm determined to continue my career in artificial intelligence after the internship is over. My ultimate objective is to establish a setting where AI brings people together for the purpose of raising the standard of living. I am really excited about the chance to work with like-minded people and truly make a difference.

boazleleina commented 9 months ago

Task 5: Submit your first contribution to the Outreachy site

I made my first contribution to the Outreachy website and the contribution was recorded

boazleleina commented 9 months ago

Week 2 - Install and run an ML model

Task 1: Select a model from the suggested list

I selected Plasma Protein Binding (IDL-PPBopt)

Reason for choosing the model

The focus of this research is on understanding and manipulating how chemical compounds bind to human plasma proteins. This is a critical aspect of pharmacology and drug development because it affects the distribution and activity of drugs in the human body. In a world with so many options for drugs and high risks and side effects involved with each drug, this research is important in helping determine how each drug affects the body and guiding the development of better drugs to combat diseases in the future. The research employs deep learning techniques, a subset of machine learning, to accomplish the prediction and optimization goals. Which would be an interesting challenge for me. I look forward to working with AttentionFP algorithm (Attentive Feature Propagation for Molecular Property Prediction) designed for the prediction of molecular properties, particularly in the field of chemoinformatics and drug discovery. It leverages a deep learning architecture to capture and analyze the structural information of chemical compounds in a way that is suitable for property prediction tasks. AttentionFP is built on Graph Neural Networks, which is quite interesting as molecular data, chemical compounds can be represented as graphs, where atoms are nodes, and chemical bonds are edges. This can be seen in some outputs in the notebook which makes it easier to understand and visualize.

boazleleina commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git

The project has some package requirements that it needs to run including: 📦

python 3.7
pytorch 1.5.0
openbabel 2.4.1
rdkit
scikit learn
scipy
cairosvg
pandas
matplotlib
sklearn

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:

conda create --name IDLenv python=3.7
conda activate IDLenv

Installing Dependencies 🧩

python 3.7 - Creating the conda environment running this Python version
pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- " I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."
openbabel 2.4.1 - pip install openbabel openbableerror.log
- After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel
- In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.
rdkit - pip install rdkit
scikit learn - pip install scikit-learn
scipy - pip install scipy
cairosvg - pip install cairosvg
pandas - pip install pandas
matplotlib - pip install matplotlib

Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

The first cell generated an ipykernel package problem, I found a workaround by running: conda activate "D:\ERSILIA MODEL\.conda" conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall
- This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.
I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran: pip install torch
Another error that was generated was

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

I realized that was because I had not installed numpy and just ran the code: pip install numpy
I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:
- Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

SOLVING CUDA ISSUE🔄

The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.

Steps I took to change this:🌓

In the folders in AttentiveFP directory made changes to the files:
- AttentiveLayers_viz.py
- AttentiveLayers.py
- I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work
In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:
- In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor

In Load the model function cell:
- I commented out the lines # Remove.cuda() calls # model.cuda()
- I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Model Success🎉

The model successfully predicted the values:

predictedvalues.log

The graphs were generated by AttentionFP and I could see the structure of different substructure molecules

graphgenerated.log graphgeneratedimg.pdf

graph2generated.log graph2generatedimg.pdf

graph3generated.log graph3generatedimg.pdf

graph4generated.log graph4generatedimg.pdf

The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures

substructures.log

The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

DhanshreeA commented 9 months ago

Hi @boazleleina thanks for the detailed updates! Could you comment on the difference between the model you ran locally and the one present within the hub? Specifically which substructures are different and/or missing between the two outputs?

boazleleina commented 9 months ago

Task 3: Run predictions for the EML

I used the provided Essential Medicines List in the form of eml_canonical.csv file and loaded the dataset to my model using a notebook file I created eml_canonical.ipynb
This file had the essential packages and only the code used for the prediction that I used to run the new dataset which was copied from the original IDL-PPBopt.ipynb file
My model prediction for the eml_canonical also ran within the same virtual environment and inherited from the same dependencies.

I edited the file running predictions for the Essential Medicines List to point at the smiles column instead of cano_smiles, which is the appropriate column name for the eml_canonical.csv which I used to run the predictions.
The model ran successfully and made predictions for the drugs using the smiles column.
The successful log and the predictions(stored in a csv) are attached below🔽

eml_canonical_output.log

eml_canonical_predictions.csv

From the log file I noted that the number of all smiles was 442 but the number of rows in the eml_canonical_predictions.csv is 433, this is made clear in the log file as 9 compounds could not be features as noted in the output recorded in the log file.

Explanation of the outcome The model predicts the likelihood of various drugs binding to proteins in human blood plasma, using deep learning model, the predicted values range between 0 and 1, with the values closer to 1 showing a higher probability of the drug binding and those closer to 0 showing a lower probability of the drug binding to the proteins in the human blood plasma

This model is a Neural Network (with appropriate activation functions) that is used for the regression task, specifically using AttentiveFP, which is built on the Graph Neural Network with feedforward propagation, and different activation functions including linear, leakyrelu, relu and softmax.

boazleleina commented 9 months ago

Install and run Docker!

I launched docker desktop to activate docker commands within WSL. The docker I am running has WSL 2 based engine activated, thus launching the application automatically connects to WSL.
I ran the code, docker pull ersiliaos/eos22io, to pull the IDL-PPB image from Ersilia Hub to docker, the model pulled successfully and began running, the logs of the process were generated and attached below:

pullppbmodel.log

I copied the eml csv file to the docker container using the code:
- docker cp /mnt/c/Users/Administrator/Downloads/eml_canonical.csv jovial_hodgkin:/root
After copying the csv file to the docker environment, I opened the docker interactive environment using:
- docker exec -it jovial_hodgkin /bin/bash

After executing the docker container and opening the interactive environment, I ran predictions for the eml_canonical dataset successfully:

ersilia -v api run -i eml_canonical.csv -o eml_ersilia_output.csv

ersiliapredictionlog.log

To copy the prediction values from docker to my local computer I ran the code:
- docker cp jovial_hodgkin:/root/ eml_ersilia_output.csv /mnt/c/Users/Administrator/Downloads
  
  Compare results with the Ersilia Model Hub implementation!
  
  Differences on the original code and Ersilia model Hub prediction values:
The model from Ersilia Hub was able to run 442 predictions, this means that 9 compounds that could not be features in the original code were features in the hub code and prediction values equated to NULL, the nine features are listed below:

key	input
FAQLAUHZSGTTLN-UHFFFAOYSA-N	[CaH2]
KRHYYFGTRYWZRS-UHFFFAOYSA-M	[F-]l
ZCYVEMRRCGMTRW-UHFFFAOYSA-N	[I]
XLYOFNOQVPJJNP-UHFFFAOYSA-N	O
WCUXLLCKKVVCTQ-UHFFFAOYSA-M	[Cl-].[K+]
NLKNQRATVPKPDG-UHFFFAOYSA-M	[K+].[I-]
RWSOTUBLDIXVET-UHFFFAOYSA-N	S
FJKGRAZQBBWYLG-UHFFFAOYSA-M	N.N.[F-].[Ag+]
FAPWRFPIFSIZLT-UHFFFAOYSA-M	[Na+].[Cl-]

To get the features, I ran a simple Python script to extract the values in eml_ersilia_output.csv, which is the hubs predictions values, that are not in eml_canonical_predictions.csv that are the original code prediction values
These features are not SMILES (Simplified Molecular Input Line Entry System) notation. Rather they are text representations of ions and thus could not be features, that's why they are don't have predictions and are equated to NULL.

DhanshreeA commented 9 months ago

Great job @boazleleina thank you for the updates!

boazleleina commented 9 months ago

Week 3 - Propose new models

Task 1 - Suggest a new model and document it (1)

Model Name : Controlled peptide generation Model Link🔗 : https://github.com/IBM/controlled-peptide-generation/tree/master Model License 📑: Apache-2.0 license

This model uses the Pytorch framework and is generated from the research paper “Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics” by Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu Cipcigan, Vijil Chenthamarakshan, Hendrik Strobelt, Cicero dos Santos, Pin-Yu Chen, Yi Yan Yang, Jeremy Tan, James Hedrick, Jason Crain, Aleksandra Mojsilovic, submitted on 22nd May 2020 and revised on 26th Feb 2021, in the Machine Learning section of arxiv.org

Paper Link🔗: https://arxiv.org/abs/2005.11248

Overview

The research focuses on using Controllable Deep Generative Models and Molecular Dynamics to create new antimicrobial compounds from scratch.
They note from the paper, that De novo ( meaning “from the beginning” or “brand new”) therapeutic molecule design remains a cost and time-intensive process: It typically requires more than ten years and $2-3 B USD for a new drug to reach the market, and the success rate could be as low as < 1%. (1, 2).
This basically means that creating entirely new and custom-designed therapeutic agents, such as drugs or therapeutic molecules, from scratch, rather than modifying existing ones is extremely expensive and takes a long time with a very low likelihood of success.
The model focuses on Antimicrobial Peptides (AMPs): AMPs are a type of therapeutic molecule that can combat antibiotic resistance, a major global health concern.
Drug-resistant infections are on the rise and are responsible for a significant number of deaths each year.
The model uses CLaSS (Controlled Latent attribute Space Sampling) that leverages several components in machine learning:
- Deep Generative Autoencoder: This is a type of machine learning model used to generate novel molecules. It can create new molecules based on patterns and features learned from existing molecules.
- Deep-Learning Classifiers: These are used to screen the generated molecules. Deep learning models are used to assess the potential antimicrobial activity of the compounds.
The model during creation and training, worked on Synthesis and testing of only twenty designed sequences and identified two novel and minimalist AMPs with high potency against diverse Gram-positive and Gram-negative pathogens, including one multidrug-resistant and one antibiotic-resistant K. pneumoniae, via membrane pore formation.

Why the model is relevant to Ersilia

With Ersilia’s goal to equip laboratories in Low- and Middle-Income Countries with state-of-the-art AI/ML tools for infectious and neglected disease research, this ML model will enable laboratories in Low-Income areas to work on drug creation at low costs, combining computational techniques and experimental validation, can potentially accelerate the process of discovering and developing new drugs while reducing the cost significantly.
De novo therapeutic design allows for the creation of highly specific and customized therapies. Instead of modifying existing drugs, researchers can design molecules with precisely tailored properties to target specific diseases or conditions, e.g., diseases affecting their specific countries. This can lead to more effective treatments with fewer side effects.
With the growing concern over drug-resistant diseases and the need for innovative solutions to address emerging health challenges (such as COVID-19 and other infectious diseases), research that explores novel approaches to drug design becomes increasingly relevant.
The discovery of molecules that are effective against multidrug-resistant bacteria is particularly important. Multidrug-resistant pathogens pose a significant threat to public health, and finding new treatments for these infections is of utmost importance. This will allow labs that Ersilia works with to be at the forefront of this innovative drug-creation method.

Working with the model

The model repository on GitHub has provided the datasets to work with and also installation steps to run the model; starting with Autoencoder (VAE/WAE) Training and CLaSS (Controlled Latent attribute Space Sampling)
The model generated in the repository uses only short versions of data files required by the data curation code but the full versions of the datasets are also provided within the repository, which we can use to run and create a new model curated for all the data available.
The dataset sources are:
- UNIPROT: ( https://www.uniprot.org/uniprot/?query=reviewed:yes ) and ( https://www.uniprot.org/uniprot/?query=reviewed:no )
- SATPDB: ( http://crdd.osdd.net/raghava/satpdb/ )
- DBAASP: ( https://dbaasp.org/ )
- AMPEP: ( https://cbbio.cis.um.edu.mo/software/AmPEP/ )
- ToxinPred: ( https://webs.iiitd.edu.in/raghava/toxinpred/dataset.php )
Improvements can be made to the model by working with more designed sequences to increase the probability of identifying more AMPs. With this improvement, the model could identify even more useful AMPs that could target even more drug-resistant or antibiotic-resistant compounds.
Only a sample model pipeline is provided with the code, working on creating a full pipeline for data feeding, processing, and predictions is also something to check on.

boazleleina commented 9 months ago

Task 2 - Suggest a new model and document it (2)

Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None

This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023

Paper Link 🔗: https://arxiv.org/abs/2308.05115

Overview

PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases.
Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases
The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein.
The research paper mentions examples of previous models like:
- MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites.
However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1)
The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs.
PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.

Why the model is relevant to Ersilia

Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.
A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.
Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

The model’s data is provided in the ‘data’ folder in the GitHub repository.
The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided python model_train_test/pretrained_embedding_generate.py file.
The model checkpoints are provided and you can download the already generated model provided here
We are also provided with the option to generate the model again by running ./model_train_test/train.py ,to train the PTransIPs model in ./model_train_test/PTransIPs_model.py
The current pre-trained model achieved AUROCs of 0.9232 and 0.9660 for identifying phosphorylated S/T and Y sites respectively.
Running ./model_train_test/umap_test.py will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.

boazleleina commented 9 months ago

Task 3 - Suggest a new model and document it (3)

Model name: Multitask-toxicity Model Link🔗: https://github.com/IBM/multitask-toxicity#accurate--clinical-toxicity-prediction-using-multi-task-deep-neural-nets-and-contrastive-molecular-explanations Model License 📑: Apache-2.0 license

This model uses the Pytorch framework, and is generated from the research paper, “Accurate Clinical Toxicity Prediction using Multi-task Deep Neural Nets and Contrastive Molecular Explanations.” By Bhanushee Sharma, Vijil Chenthamarakshan, Amit Dhurandhar, Shiranee Pereira, James A. Hendler, Jonathan S. Dordick, and Payel Das

Paper Link🔗: https://arxiv.org/abs/2204.06614

Overview

This research aims to provide more accurate and sophisticated predictions of chemical toxicity, which is crucial for drug development and chemical safety while also reducing the need for time-consuming and costly experiments.
This model can help in assessing the safety and potential harm of new drugs and chemicals.
According to the research paper, “The traditional methods for testing toxicity often require time-consuming and costly experiments, sometimes involving animals. However, by using machine learning models, we can make predictions about toxicity more efficiently, reducing the need for such experiments and ethical concerns related to animal testing.”
In this paper, they've created a deep-learning( Pytorch ) framework that considers various aspects of toxicity. They use different representations of molecular data to build these models. The goal is to predict toxicity accurately, including clinical toxicity, which is especially relevant for human health.
This research used two different ways to represent the structure of molecules: Morgan Fingerprints (FP) and SMILES embeddings (SE). These representations help the machine learning model understand the chemical structures. FP is simpler and widely used, while SE is more complex and captures relationships between different substructures in molecules.

Why the model is relevant to Ersilia

Using predictive models to assess toxicity can reduce the reliance on animal and clinical testing. This is ethically important as it minimizes harm to animals and reduces the need for human clinical trials, making the drug development process more humane.
Early toxicity prediction helps in reducing the cost of drug development. Clinical trials are expensive, and identifying toxicity issues before reaching this stage can lead to significant cost savings.
The use of machine learning models provides a more efficient and rapid way to screen potential drug candidates. It accelerates the process of identifying compounds that are likely to be safe and effective.

Working with the model

The scripts for the model were run in an anaconda environment.
The repository has two types of models that have been built: multi-task Deep Neural Networks (MTDNN) and single-task Deep Neural Networks (STDNN). The MTDNN predicts multiple types of toxicity on different platforms (in vivo, in vitro, clinical) with one model. In contrast, the STDNN uses separate models for each platform. This allows us to assess and compare the performance of these models.
The model’s dataset, Tox21 ( in vitro ), and ClinTox (clinical) datasets from MoleculeNet, are available in the _"data/datasets/rawdata" folder and a link to the original source of data is also provided:
- ClinTox source here
- Tox21 source: MoleculeNet source and the dataset here
Data has already been split using the seeds and random splitting method from MoleculeNet. The process to obtain the splits is provided in the _"data/datasets/obtainingsplits" folder
The model also uses Transfer Learning to train the ML models. We are provided with notebooks under the “Transfer Learning” folder to create and test transfer learning models using different combinations of Tox21 ( in vitro ) and RTECS ( in vivo ) as base models
We are not provided with a pre-saved model but are given clear instructions on how to run and obtain the different models using either Transfer Learning or Deep Learning.

DhanshreeA commented 9 months ago

Task 2 - Suggest a new model and document it (2)

Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None

Paper Link 🔗: https://arxiv.org/abs/2308.05115

Overview

* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases.

* Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases

* The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein.

* The research paper mentions examples of previous models like:

  * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites.

* However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1)

* The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs.

* PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.

Why the model is relevant to Ersilia

1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.

2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.

3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

* The model’s data is provided in the **_‘data’_** folder in the GitHub repository.

* The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file.

* The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC)

* We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py`

* The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively.

* Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.

Hi @boazleleina thank you for very interesting paper suggestion! Although it seems from the literature that the model should work with simple SMILES strings, but upon inspection of the code, I could not identify what exactly would be the inputs and outputs for this model. Of course I might have missed something, so I'm curious to know more about your understanding of the code. Thank you very much!

boazleleina commented 9 months ago

Task 2 - Suggest a new model and document it (2)

Model name: PTransIPs Model Link 🔗: https://github.com/StatXzy7/PTransIPs/tree/main Model License 📑: None This model, built on the Pytorch framework, is generated from the research paper “PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer” by Ziyang Xu, Haitian Zhong submitted on 8th Aug 2023 and revised on 18th Aug 2023 Paper Link 🔗: https://arxiv.org/abs/2308.05115 Overview

* PTransIPs focuses on identifying certain chemical changes in proteins. These chemical changes, called phosphorylation, play a big role in how our cells work and can affect various diseases.

* Finding and understanding these changes is important because it can help us learn more about how cells function and how diseases develop. It might even help us discover new ways to treat diseases

* The PTransIPs model works by looking at the building blocks of proteins (amino acids) and their arrangement. It also uses data from other big computer models that know a lot about proteins. All this information helps PTransIPs figure out where the phosphorylation happens in the protein.

* The research paper mentions examples of previous models like:

  * MusiteDeep2017, which employs a CNN model for the identification of kinase-specific phosphorylation sites . Following this, there was the introduction of DeepPhos, a model that also utilizes CNNs for identifying general phosphorylation sites. And the development of MusiteDeep2020, which utilizes a CapsNet for the identification of phosphorylated S/T and Y sites. Another model that is mentioned is, DeepIPs, a model that incorporates both CNN and LSTM for the identification of phosphorylated S/T and Y sites.

* However, the research paper takes care to note that these models are limited, “Due to sample size limitations. Considering that phosphorylation is a post-translational modification process on protein molecules, models learned on limited samples may not fully capture the characteristics of proteins, resulting in poor extrapolation ability. On the other hand, this can also lead to some common problems such as overfitting, thus failing to acquire the essential features for site identification.” (p1)

* The model works by treating amino acids within protein sequences as words, extracting unique encodings based on the types along with the position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs.

* PTransIPS is further trained on a combination model of convolutional neural network (CNN) with residual connections and a Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer.

Why the model is relevant to Ersilia

1. Identifying phosphorylation sites can lead to the discovery of new therapeutic targets for diseases. This knowledge can be used to develop potential drug targets.

2. A deep understanding of phosphorylation has significant value in biomedical research, as it plays a crucial role in cellular processes and disease development.

3. Pharmaceutical companies and researchers working on drug discovery, especially for diseases like COVID-19, can use the model’s findings to explore new drug targets related to phosphorylation.

Working with the model

* The model’s data is provided in the **_‘data’_** folder in the GitHub repository.

* The model works by taking the sequence pre-trained embedding as input which is provided or that can be generated by running the provided `python model_train_test/pretrained_embedding_generate.py` file.

* The model checkpoints are provided and you can download the already generated model provided [here](https://onedrive.live.com/?authkey=%21ABQMKMz0oWnPnQ4&id=6F7A588E449ED6AC%21915&cid=6F7A588E449ED6AC)

* We are also provided with the option to generate the model again by running `./model_train_test/train.py` ,to train the PTransIPs model in `./model_train_test/PTransIPs_model.py`

* The current pre-trained model achieved AUROCs of _0.9232_ and _0.9660_ for identifying phosphorylated S/T and Y sites respectively.

* Running `./model_train_test/umap_test.py` will generate umap visualization figures. We can visualize data that we provide to the pre-trained model or the already existing data.

Thank you for your response @DhanshreeA. From my understanding going through the paper and code, the model is trained using labeled datasets, which are locations within a protein sequence where amino acids are found and these are in the form of strings. For this model, they used Y-Sites and S/T Sites, some of the 20 standard amino acids commonly found in proteins. These amino acid sites have already been pretrained by ProtTrans which provides state-of-the-art pre-trained models for proteins that according to the page, "were trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models", and they provide a link to where they found this data here. The output of the model is the probability of amino acids being found in specific protein sequence and their location. The dataset is provided but we can also generate them from the original location here. The model currently works with Protein Sequences as strings as it is; but is open for editing to find out if it can work with other string options like SMILES. Please let me know if this is within the scope and if I need to clarify something. Thank you.

kbatya commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git

The project has some package requirements that it needs to run including: 📦

python 3.7

pytorch 1.5.0

openbabel 2.4.1

rdkit

scikit learn

scipy

cairosvg

pandas

matplotlib

sklearn

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:
conda create --name IDLenv python=3.7
conda activate IDLenv
Installing Dependencies 🧩

python 3.7 - Creating the conda environment running this Python version

pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

" I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."

openbabel 2.4.1 - pip install openbabel openbableerror.log

After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel

In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.

rdkit - pip install rdkit

scikit learn - pip install scikit-learn

scipy - pip install scipy

cairosvg - pip install cairosvg

pandas - pip install pandas

matplotlib - pip install matplotlib

Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

The first cell generated an ipykernel package problem, I found a workaround by running: conda activate "D:\ERSILIA MODEL\.conda" conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall

This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.

I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran: pip install torch

Another error that was generated was

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

I realized that was because I had not installed numpy and just ran the code: pip install numpy

I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:

Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

SOLVING CUDA ISSUE🔄

The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.

Steps I took to change this:🌓

In the folders in AttentiveFP directory made changes to the files:

AttentiveLayers_viz.py

AttentiveLayers.py

I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work

In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:

In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor

In Load the model function cell:

I commented out the lines # Remove.cuda() calls # model.cuda()

I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Model Success🎉

The model successfully predicted the values:

predictedvalues.log

The graphs were generated by AttentionFP and I could see the structure of different substructure molecules

graphgenerated.log graphgeneratedimg.pdf

graph2generated.log graph2generatedimg.pdf

graph3generated.log graph3generatedimg.pdf

graph4generated.log graph4generatedimg.pdf

The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures

substructures.log

The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!

boazleleina commented 9 months ago

Task 2: Install the model in your system

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

To clone this project into my computer,I opened the location which was my D:\ drive and from the command prompt for windows and ran: git clone https://github.com/Louchaofeng/IDL-PPBopt.git The project has some package requirements that it needs to run including: 📦

python 3.7

pytorch 1.5.0

openbabel 2.4.1

rdkit

scikit learn

scipy

cairosvg

pandas

matplotlib

sklearn

Because most of the requirements here are older versions of these applications, I installed a virtual environment using conda to run this project:
conda create --name IDLenv python=3.7
conda activate IDLenv
Installing Dependencies 🧩

python 3.7 - Creating the conda environment running this Python version

pytorch 1.5.0 - pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

" I used the -f flag is used to specify the URL where the PyTorch wheel files for version 1.5.0 are hosted. I used this URL to download the appropriate wheel file for my Windows 10 system."

openbabel 2.4.1 - pip install openbabel openbableerror.log

After still facing issues with installing openbabel with pip I ran the code below, which fixed the issue: conda install -c openbabel openbabel

In my research to fix the openbabel issue I found that it is discouraged to run: conda install -c conda-forge openbabel==2.4.1 as it forces updates of other modules in the environment and thus is safer to run the code above.

rdkit - pip install rdkit

scikit learn - pip install scikit-learn

scipy - pip install scipy

cairosvg - pip install cairosvg

pandas - pip install pandas

matplotlib - pip install matplotlib

Running our model 🔗

The model was located on the IDL-PPBopt.ipynb notebook and this was the file I focused on:

The first cell generated an ipykernel package problem, I found a workaround by running: conda activate "D:\ERSILIA MODEL\.conda" conda install -p "D:\ERSILIA MODEL\.conda" ipykernel --update-deps --force-reinstall

This command ensured that ipykernel is installed and its dependencies are updated or reinstalled if necessary.

I also ran into the ModuleNotFoundError: No module named 'torch', to fix this I ran: pip install torch

Another error that was generated was

d:\ERSILIA MODEL.conda\lib\site-packages\torch\random.py:42: UserWarning: Failed to initialize NumPy: numpy.core.multiarray failed to import (Triggered internally at C:\actions- runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_numpy.cpp:77.) return default_generator.manual_seed(seed)

I realized that was because I had not installed numpy and just ran the code: pip install numpy

I also ran across error "setting module not found" in the first cell of IDL-PPBopt.ipynb which contains our model:

Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

SOLVING CUDA ISSUE🔄

The biggest challeng was changing the device processor 'cuda' which was the default, to 'cpu' which my computer is running and converting all references to avoid runtime error.

Steps I took to change this:🌓

In the folders in AttentiveFP directory made changes to the files:

AttentiveLayers_viz.py

AttentiveLayers.py

I removed all cuda references in these files and changed them to 'cpu' Example: torch.cuda.sum() would become torch.sum(), torch.cpu.sum() would also work

In the notebook, there are several references to the cuda processor, removing reference to cuda enables the module run on the cpu:

In the Related function cell, I made changes like: torch.cuda.LongTensor to torch.LongTensor

In Load the model function cell:

I commented out the lines # Remove.cuda() calls # model.cuda()

I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Model Success🎉 The model successfully predicted the values: predictedvalues.log The graphs were generated by AttentionFP and I could see the structure of different substructure molecules graphgenerated.log graphgeneratedimg.pdf graph2generated.log graph2generatedimg.pdf graph3generated.log graph3generatedimg.pdf graph4generated.log graph4generatedimg.pdf

The model I ran locally found 8 second-level substructures as compared to the model in the GitHub page which ran 10 second-level substructures

substructures.log

The final cell also had an issue running but it was a simple syntax error with the parantheses of the write statement placed incorrectly, I edited this and the code ran correctly, I got the expected “Results.smi” file Results_smi.log

with open('Results.smi', 'w') as f: f.write('SA_Fragment\tNAS\tScore\tRES\tCES\tZES\tNTS\n') for i in range(len(r)): f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str( r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')
Thank you @boazleleina for your good job in installing and running the IDL-PPBopt model, it helps me a lot, I am very appreciating it!

You're very welcome, I am glad I could be of help😁

luiscamachocaballero commented 9 months ago

Hi @boazleleina!, I followed your steps to overcome CUDA problem, but I still keep having issues, I think an small thing is missing, I'd appreciate your help. Below is the output error when I run the IDL-PPBopt.ipynb file:

AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_868072/887561112.py in <module>
     14 loss_function = nn.MSELoss()
     15 model = Fingerprint(radius, T, num_atom_features, num_bond_features,
---> 16             fingerprint_dim, output_units_num, p_dropout)
     17 #model.cuda()
     18 

~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout)
     10         super(Fingerprint, self).__init__()
     11         # graph attention for atom embedding
---> 12         self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim)
     13         self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim)
     14         self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)])

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     70         self.in_features = in_features
     71         self.out_features = out_features
---> 72         self.weight = Parameter(torch.Tensor(out_features, in_features))
     73         if bias:
     74             self.bias = Parameter(torch.Tensor(out_features))

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
    147             raise RuntimeError(
    148                 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149         _check_driver()
    150         if _cudart is None:
    151             raise AssertionError(

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver()
     52 Found no NVIDIA driver on your system. Please check that you
     53 have an NVIDIA GPU and installed a driver from
---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22))
     55         else:
     56             # TODO: directly link to the alternative bin that needs install

AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

boazleleina commented 9 months ago

AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_868072/887561112.py in <module>
     14 loss_function = nn.MSELoss()
     15 model = Fingerprint(radius, T, num_atom_features, num_bond_features,
---> 16             fingerprint_dim, output_units_num, p_dropout)
     17 #model.cuda()
     18 

~/IDL-PPBopt/Code/AttentiveFP/AttentiveLayers.py in __init__(self, radius, T, input_feature_dim, input_bond_dim, fingerprint_dim, output_units_num, p_dropout)
     10         super(Fingerprint, self).__init__()
     11         # graph attention for atom embedding
---> 12         self.atom_fc = nn.Linear(input_feature_dim, fingerprint_dim)
     13         self.neighbor_fc = nn.Linear(input_feature_dim+input_bond_dim, fingerprint_dim)
     14         self.GRUCell = nn.ModuleList([nn.GRUCell(fingerprint_dim, fingerprint_dim) for r in range(radius)])

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     70         self.in_features = in_features
     71         self.out_features = out_features
---> 72         self.weight = Parameter(torch.Tensor(out_features, in_features))
     73         if bias:
     74             self.bias = Parameter(torch.Tensor(out_features))

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _lazy_init()
    147             raise RuntimeError(
    148                 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149         _check_driver()
    150         if _cudart is None:
    151             raise AssertionError(

~/miniconda3/envs/PPBenv/lib/python3.7/site-packages/torch/cuda/__init__.py in _check_driver()
     52 Found no NVIDIA driver on your system. Please check that you
     53 have an NVIDIA GPU and installed a driver from
---> 54 [http://www.nvidia.com/Download/index.aspx""")](http://www.nvidia.com/Download/index.aspx%22%22%22))
     55         else:
     56             # TODO: directly link to the alternative bin that needs install

AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

From the error I can read here @luiscamachocaballero it seems the code is still trying to run on cuda. Did you make changes to the AttentiveFP files mentioned in my issue? Also after commenting out model.cuda please replace it with model.cpu. I have included a snapshot of the lines that seem to be giving you the error. Try coping my edited code and using it in your file and see if it will solve the issue. If all the cells before that ran without errors then the edit should work. Feel free to reach out in case of any more errors.

loss_function = nn.MSELoss() model = Fingerprint(radius, T, num_atom_features, num_bond_features, fingerprint_dim, output_units_num, p_dropout) model.cpu()

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_' + '54' + '.pt', map_location=torch.device('cpu'))

best_model_dict = best_model.state_dict() best_model_wts = copy.deepcopy(best_model_dict)

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia

✍️ Contribution period: Boaz Leleina #830

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

MOTIVATION TO WORK AT ERSILIA

Week 2 - Install and run an ML model

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

sklearn

Installing Dependencies 🧩

Running our model 🔗

Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

I also mapped location to run the model on the cpu by editing the line of code: `best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')`

Task 3: Run predictions for the EML

Install and run Docker!

`docker cp jovial_hodgkin:/root/ eml_ersilia_output.csv /mnt/c/Users/Administrator/Downloads`

Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

Task 1 - Suggest a new model and document it (1)

Task 2 - Suggest a new model and document it (2)

Task 3 - Suggest a new model and document it (3)

Task 2 - Suggest a new model and document it (2)

Task 2 - Suggest a new model and document it (2)

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

Installing Dependencies 🧩

Running our model 🔗

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

Installing Dependencies 🧩

Running our model 🔗

ersilia-os / ersilia

✍️ Contribution period: Boaz Leleina #830

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

MOTIVATION TO WORK AT ERSILIA

Week 2 - Install and run an ML model

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

sklearn

Installing Dependencies 🧩

Running our model 🔗

Commenting out the line of code importing the setting module and where it is called from solved the issue as it is only used to raise warnings and thus not critical

I also mapped location to run the model on the cpu by editing the line of code: best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

Task 3: Run predictions for the EML

Install and run Docker!

docker cp jovial_hodgkin:/root/ eml_ersilia_output.csv /mnt/c/Users/Administrator/Downloads

Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

Task 1 - Suggest a new model and document it (1)

Task 2 - Suggest a new model and document it (2)

Task 3 - Suggest a new model and document it (3)

Task 2 - Suggest a new model and document it (2)

Task 2 - Suggest a new model and document it (2)

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

Installing Dependencies 🧩

Running our model 🔗

IDL-PPBOT RUNNING STEPS🏃‍♂️💨

Installing Dependencies 🧩

Running our model 🔗

I also mapped location to run the model on the cpu by editing the line of code: `best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')`

`docker cp jovial_hodgkin:/root/ eml_ersilia_output.csv /mnt/c/Users/Administrator/Downloads`