JocelynSong / SurfPro

17 stars 1 forks source link

SurfPro: Functional Protein Design Based on Continuous Surface

Model Architecture

This repository contains code and data for ICML 2024 paper SurfPro: Functional Protein Design Based on Continuous Surface

The overall model architecture is shown below:

image

Environment

The dependencies can be set up using the following commands:

conda create -n surfpro python=3.8 -y 
conda activate surfpro 
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y 
bash setup.sh 

Download Data

We provide all surface data for cath 4.2, binder design task and enzyme design task at SurfPro data

mkdir data 
cd data 
mkdir cath42 && cd cath42
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
cd .. && mkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_link
cd .. && mkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_link

Prepare Surface for your own model

We provided surface data for all three tasks at SurfPro data.

If you want to generate surface from your own PDB files, please use the file preprocess/prepare_surface.py

You need to first apply MSMS tool to generate the corresponding vert files. Then you need to provide the corresponding fasta file and vert files to prepare corresponding surfaces. To run the code:

python preprocess/prepare_surface.py --data_path fasta_file_path --split train --output_path output_data_path

The vert files are put in fasta_file_path/msms directory by default.

Inverse Folding Task Training

First Download the corresponding data and decompress it:

mkdir binder_design && cd binder_design
wget https://drive.google.com/file/d/1_IUTRpQtQpoPzxUDD7cTzUn150hViN5h/view?usp=drive_link
tar -xvzf octree_aa_surf_5k_sorted.tar.gz

Then training the model:

bash train_suface_inverse_folding.sh

Binder Design Training

First Download the corresponding data:

mkdir binder_design && cd binder_design
wget https://drive.google.com/drive/folders/1S7fg-XWBSy6-Pq7bSG_IrlLgLi3ESoX3?usp=drive_link

Then decompress the target-binder data which are necessary for pAE_interaction evaluation.

cd binder_design/Binder_Design_Data
tar -xvzf binder.pkl.tar.gz

Then training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:

bash binder_design_finetune.sh

Enzyme Design Training

First Download the corresponding data:

mkdir enzyme_design && cd enzyme_design
wget https://drive.google.com/drive/folders/13EpZ1u7l28W0aR2WfXhIBooK5LZpXqTU?usp=drive_link

Then training the model. Suppose the model ckpt from inverse folding task is at cath_model_path/checkpoint_best.pt. The training script is shown below:

bash binder_design_finetune.sh

Inference

To generate protein sequences for CATH 4.2, design binders or design enzymes:

bash generation_cath42.sh
bash generate_binder.sh
bash generate_enzyme.sh

There are two items in the output directory:

  1. protein.txt refers to the designed protein sequence
  2. src.seq.txt refers to the ground truth sequences

Evaluation

Inverse Folding Task Evaluation

We provide the recovery rate calculate after pairwise alignment at evaluation/amino_acid_recovary_rate.py.

You need to provide the source sequence and designed sequence files.

Binder Design Task Evaluation

To calculate the superimpose files of designed binder and target proteins, please use file evaluation/super_impose.py

Then we apply scripts provided at dl_binder_design to calculate pAE_interaction scores.

Enzyme Design Task Evaluation

We provide the ESP evaluation data at ESP_data_eval

The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.

The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link

Citation

If you find our work helpful, please consider citing our paper.

@inproceedings{songsurfpro,
  title={SurfPro: Functional Protein Design Based on Continuous Surface},
  author={Song, Zhenqiao and Huang, Tinglin and Li, Lei and Jin, Wengong},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024}
}