DeepRank / DeepRank-GNN-esm

Graph Network for protein-protein interface including language model features
Apache License 2.0
25 stars 6 forks source link
deep-learning graph-networks interface-classification language-model protein-protein-interaction scoring utrecht-university

:bell: Archiving Note

Since DeepRank-GNN is no longer in active development, we migrated our DeepRank-GNN-esm version to our new repo at haddocking/DeepRank-GNN-esm.

For details refer to our publication "DeepRank-GNN-esm: a graph neural network for scoring protein–protein models using protein language model" at https://academic.oup.com/bioinformaticsadvances/article/4/1/vbad191/7511844

:snowflake: This repository is now frozen. :snowflake:

DeepRank-GNN-esm

Graph Network for protein-protein interface including language model features

Installation

With Anaconda

  1. Clone the repository

    git clone https://github.com/DeepRank/DeepRank-GNN-esm.git
    cd DeepRank-GNN-esm
  2. Install either the CPU or GPU version of DeepRank-GNN-esm

    conda env create -f environment-cpu.yml && conda activate deeprank-gnn-esm-cpu-env

    OR

    conda env create -f environment-gpu.yml && conda activate deeprank-gnn-esm-gpu-env
  3. Install the command line tool

    pip install .
  4. Run the tests to make sure everything is working

    pytest tests/

Usage

As a scoring function

We provide a command-line interface for DeepRank-GNN-ESM that can be used to score protein-protein complexes. The command-line interface can be used as follows:

usage: deeprank-gnn-esm-predict [-h] pdb_file chain_id_1 chain_id_2

positional arguments:
  pdb_file    Path to the PDB file.
  chain_id_1  First chain ID.
  chain_id_2  Second chain ID.

optional arguments:
  -h, --help  show this help message and exit

Example, score the 1B6C complex

# download it
$ wget https://files.rcsb.org/view/1B6C.pdb -q

# make sure the environment is activated
$ conda activate deeprank-gnn-esm-gpu-env
(deeprank-gnn-esm-gpu-env) $ deeprank-gnn-esm-predict 1B6C.pdb A B
 2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B
 2023-06-28 06:08:21,945 predict:72 INFO - Renumbering PDB file.
 2023-06-28 06:08:22,294 predict:104 INFO - Reading sequence of PDB 1B6C.pdb
 2023-06-28 06:08:22,423 predict:131 INFO - Generating embedding for protein sequence.
 2023-06-28 06:08:22,423 predict:132 INFO - ################################################################################
 2023-06-28 06:08:32,447 predict:138 INFO - Transferred model to GPU
 2023-06-28 06:08:32,450 predict:147 INFO - Read /home/1B6C-gnn_esm_pred_A_B/all.fasta with 2 sequences
 2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 1 batches (2 sequences)
 2023-06-28 06:08:36,462 predict:200 INFO - ################################################################################
 2023-06-28 06:08:36,470 predict:205 INFO - Generating graph, using 79 processors
 Graphs added to the HDF5 file
 Embedding added to the /home/1B6C-gnn_esm_pred_A_B/graph.hdf5 file file
 2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B/graph.hdf5
 2023-06-28 06:09:03,345 predict:226 INFO - Predicting fnat of protein complex.
 2023-06-28 06:09:03,345 predict:234 INFO - Using device: cuda:0
 # ...
 2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C between chainA and chainB: 0.359
 2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/GNN_esm_prediction.csv

From the output above you can see that the predicted fnat for the 1B6C complex between chainA and chainB is 0.359, this information is also written to the GNN_esm_prediction.csv file.

The command above will generate a folder in the current working directory, containing the following:

1B6C-gnn_esm_pred_A_B
├── 1B6C.pdb                   #input pdb file 
├── all.fasta                  #fasta sequence for the pdb input 
├── 1B6C.A.pt                  #esm-2 embedding for chainA in protein 1B6C
├── 1B6C.B.pt                  #esm-2 embedding for chainB in protein 1B6C
├── graph.hdf5                 #input protein graph in hdf5 format 
├── GNN_esm_prediction.hdf5    #prediction output in hdf5 format
└── GNN_esm_prediction.csv     #prediction output in csv format 

As a framework

Generate esm-2 embeddings for your protein

  1. Generate fasta sequence in bulk, use script 'get_fasta.py'

    usage: get_fasta.py [-h] pdb_dir output_fasta_name
    
    positional arguments:
      pdb_dir            Path to the directory containing PDB files
      output_fasta_name  Name of the combined output FASTA file
    
    options:
      -h, --help         show this help message and exit
  2. Generate embeddings in bulk from combined fasta files, use the script provided inside esm-2 package,

    $ python esm_2_installation_location/scripts/extract.py \
        esm2_t33_650M_UR50D \
        all.fasta \
        tests/data/embedding/1ATN/ \
        --repr_layers 0 32 33 \
        --include mean per_tok

    Replace 'esm_2_installation_location' with your installation location, 'all.fasta' with fasta sequence generated above, 'tests/data/embedding/1ATN/' with the output folder name for esm embeddings

Generate graph

Use pre-trained models to predict