CDPred is a deep transformer tool for predicting interchain residue-residue distances of protein dimers or any two interacting chains in protein complexes. For a homodimer consiting of two identical chains, it takes the tertiary structure and multiple sequence alignment (MSA) of a chain as input to predict the residue-residue distances between the two chains. For a heterodimer consisting of two different chains, it takes the tertiary structures and MSA of the two chains as input to predict residue-residue distances between the two chains. The tertiary structure input for a chain can be generated by any tertiary structure prediction tool such as AlphaFold. The MSA can also be prepared by a third-party tool. If necessary, users can also use a custom script in this package to generate a MSA as input. The input is converted into numerical features that are used by 2D attention-based transformer networks to predict residue-residue distances between two chains in a dimer.
This package is developed on Linux. The package has been tested on the following two Linux systems:
Linux: Ubuntu 16.04
Linux: CentOS Linux release 7.9.2009
The system is developed and tested under Python 3.6.x. The main dependent packages and their versions are as follows. For more detail, please check the requirment.txt file.
fair-esm==0.3.1
Keras==2.1.6
matplotlib==3.3.4
numpy==1.16.2
tensorflow==1.9.0
(1) Download CDPred package (a short path for the package is recommended)
git clone https://github.com/BioinfoMachineLearning/CDPred.git
cd CDPred
(2) Install and activate Python 3.6.x environment on Linux (required)
The installation of Python 3.6.x may be different for different Linux systems.
mkdir env
python3.6 -m venv env/CDPred_virenv
source env/CDPred_virenv/bin/activate
pip install --upgrade pip
pip install -r requirments.txt
(3) Download Uniref90
Download the Uniref90_01_2020 database from znodo for PSSM generation
aria2c -x 10 https://zenodo.org/record/7650566/files/uniref90_01_2020.tar.xz?download=1
xz -d -T 4 uniref90_01_2020.tar.xz
tar -xvf uniref90_01_2020.tar
Modify the Uniref90 path in script ./lib/constants.py as /Download_Path/uniref90_01_2020/uniref90
The installation and configuration of the virtual environment lasts about 10 minutes (minor difference on different devices).
And the the Uniref90 database download will take about 40 minutes to 70 minutes, dependent on your network speed.
If you encounter any errors related to the shared library while generating features, please add the corresponding libraries to your system path.
Command:
python CDPred_Installation_Path/lib/Model_predict.py -n [name] -p [pdb_file_list] -a [a3m_file] -m [model_option] -o [out_path]
Parameters:
-n
– The name of the protein complex, can be protein ID or custom name.
-p
– The predicted monomer tertiary structure file or files with ".pdb" suffix. For homodimer inter-chain distance prediction, one predicted monomer structure file is enough. For heterodimer inter-chain distance prediction, both chains' predicted monomer structure files are required and needed to seperate by one space (Check the detail in Demo section).
-a
– Multiple sequence alignment (MSA) file in ".a3m" format. You can use your own or any third-party tool to generate MSA file, or you can follow the instruction in ZComplexMSA to install our custom MSA generation tool (Require large disk space and long time for dataset downloading).
-m
– Model option for different type prediction. Use "homodimer" for homodimer inter-chain distance prediction. Use "heterodimer" for heterodimer inter-chain distance prediction.
-o
– The custom output folder. It will be automaticly created if not exist.
Demo1: Run CDPred on a homodimer target.
python lib/Model_predict.py -n T1084A_T1084B -p ./example/T1084A_T1084B.pdb -a ./example/T1084A_T1084B.a3m -m homodimer -o ./output/T1084A_T1084B/
The location of the pre-generated output files is ./example/expection_output/T1084A_T1084B/, and the location of the output file generated by your run is ./output/T1084A_T1084B/. The whole prediction process will last about 5 minutis
Demo2: Run CDPred on a heterodimer target.
python lib/Model_predict.py -n H1017A_H1017B -p ./example/H1017A.pdb ./example/H1017B.pdb -a ./example/H1017A_H1017B.a3m -m heterodimer -o ./output/H1017A_H1017B/
The location of the pre-generated output files is ./example/expection_output/H1017A_H1017B/, and the location of the output file generated by your run is ./output/H1017A_H1017B/. The whole prediction process will last about 5 minutis
The outputs will be saved in directory provided via the-o
flag of Model_predict.py
. The outputs include multiple sequence alignment files, feature files, and prediction inter-chain distance/contact maps.
The --output_dir
directory will have the following structure, "name" is provided via the -n
flag of Model_predict.py
:
<custom_output_name>/
feature/
"name"_pssm.txt
"name".npy
"name".mat
"name".fasta
"name".dist
"name".aln
"name".a3m
predmap/
"name"_dist.rr
"name"_con.rr
"name".htxt
"name".dist
The contents of each output file are as follows:
name_pssm.txt
– Position-specific scoring matrix (PSSM) feature.name.npy
– Row attention map generated by ESM and used as one main co-evolutionary feture.name.mat
– Co-evolutinary score matrix generate by CCMpred.name.fasta
– Fasta sequence file of the input homodimer/heterodimer.name.dist
– A combination distance map in shape LxL (L:the length of dimer) of tow monomer's carbon alpha distance map that extract from input prediction monomer structure.name.aln
– Multiple sequence alignment in 'aln' .name.a3m
– Multiple sequence alignment.name_dist.rr
– Residue-Residue distance prediction in format i, j, dist
.
i
and j
indicate specifying pairs of residues.dist
indicates the prediction Euclidean heavy-atom distance between i
and j
.name_con.rr
– Residue-Residue contact prediction in format i, j, 0, 8, prob
.
i
and j
indicate specifying pairs of residues.0
and 8
indicate the distance limits defining a contact. Here a pair of residues is defined to be in contact when the minimum distance of heavy atoms is less then 8 Angstroms. prob
indicate the prediction contact probability under above distance limits.name.htxt
– Prediction inter-chain contact map.name.dist
– Prediction inter-chain distance map.Command:
python CDPred_Installation_Path/lib/distmap_evaluate.py -p [pred_map] -t [true_map] -f1 [fasta_file1] -f2 [fasta_file2]
Parameters:
-p
– The prediction contact map with '.htxt' suffix.
-t
– The nativate distance/contact map with '.htxt' suffix.
-f1
– The fasta sequence file of chain 1 of dimer.
-f2
– The fasta sequence file of chain 2 of dimer.
Demo1: Evaluate the homodimer target.
python ./lib/distmap_evaluate.py -p ./example/expection_output/T1084A_T1084B/predmap/T1084A_T1084B.htxt -t ./example/ground_truth/T1084A_T1084B.htxt -f1 ./example/ground_truth/T1084A.fasta -f2 ./example/ground_truth/T1084B.fasta
Expection output of Demo1:
NAME LEN_A LEN_B TOP5 TOP10 TOPL/10 TOPL/5 TOPL/2 TOPL
T1084A_T1084B 71 71 100.0000 100.0000 100.0000 100.0000 94.2857 91.5493
Demo2: Evaluate the heterodimer target.
python ./lib/distmap_evaluate.py -p ./example/expection_output/H1017A_H1017B/predmap/H1017A_H1017B.htxt -t ./example/ground_truth/H1017A_H1017B.htxt -f1 ./example/ground_truth/H1017A.fasta -f2 ./example/ground_truth/H1017B.fasta
Expection output of Demo2:
NAME LEN_A LEN_B TOP5 TOP10 TOPL/10 TOPL/5 TOPL/2 TOPL
H1017A_H1017B 110 125 60.0000 60.0000 54.5455 50.0000 41.8182 36.3636
This project is covered under the MIT License.
[Guo, Z., Liu, J., Skolnick, J., & Cheng, J. (2022). Guo, Z., Liu, J., Skolnick, J., & Cheng, J. (2022). Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks. Nature Communications, 13(1), 6963]