Bhattacharya-Lab / EquiPNAS

pLM-informed E(3) equivariant deep graph neural networks for protein-nucleic acid binding site prediction
GNU General Public License v3.0
19 stars 1 forks source link
graph-neural-netowrks protein-dna-interactions protein-language-model protein-rna-interactions

EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks

by Rahmatullah Roche, Bernard Moussad, Md Hossain Shuvo, Sumit Tarafder, and Debswapna Bhattacharya

published in Nucleic Acids Research

Codebase for our improved protein-nucleic binding site prediction appraoch, EquiPNAS.

Workflow

Installation

1.) We recommend conda virtual environment to install dependencies for EquiPNAS. The following command will create a virtual environment named 'EquiPNAS'

conda env create -f EquiPNAS_env.yml

2.) Then activate the virtual environment

conda activate EquiPNAS

3.) Download the trained models from here

That's it! EquiPNAS is ready to be used.

Usage

To see usage instructions, run python EquiPNAS.py -h

usage: EquiPNAS.py [-h] [--model_state_dict MODEL_STATE_DICT] [--indir INDIR] [--outdir OUTDIR] [--num_workers NUM_WORKERS]

options:
  -h, --help            show this help message and exit
  --model_state_dict MODEL_STATE_DICT
                        Saved model
  --indir INDIR         Path to input data containing distance maps and input features (default 'datasets/DNA_test_129_Preprocessing_using_AlphaFold2/')
  --outdir OUTDIR       Prediction output directory
  --num_workers NUM_WORKERS
                        Number of workers (default=4)

Here is an example of running EquiPNAS:

1.) Input target list and all input files should be inside input preprocessing directory (examples can be found here Preprocessing/). A detailed preprocessing instructions can be found here

2.) Make an output directory mkdir output

3.) Run python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir Preprocessing/ --outdir output/

4.) The residue-level protein-DNA or protein-RNA binding site predictions are generated at output/.

Training

For protein-DNA binding site prediction, we obtain the training targets from here, and for protein-RNA binding site prediction, we obtain the training targets from here. Our full train dataset containing the train code, list, and features for both protein-DNA and protein-RNA combined altogether can be found here. The procedure for training is detailed as follows:

Train scripts

Train model for protein-DNA binding site

To train protein-DNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here. Example train data for protein-DNA binding site prediction can be found here.

To retrain the protein-DNA binding site prediction model with our dataset, download the train features and data from here.

The trained model will be saved inside: model/DNA

Train model for protein-RNA binding site

To train protein-RNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here Example train data for protein-RNA binding site prediction can be found here.

To retrain the protein-RNA binding site prediction model with our dataset, download the train features and data from here.

The trained model will be saved inside: model/RNA/

Test set benchmarking

For protein-DNA binding site prediction, we obtain the test targets for Test_129 from here, and for Test_181 from here For protein-RNA binding site prediction, we obtain the test targets from here. Our full test dataset containing the test list and features for all the benchmarking datasets can be found here. The procedure for test set benchmarking is detailed as follows:

Pretrained model

Protein-DNA

Test_129

Prediction using AlphaFold2 predicted structural models

Protein-RNA

Test_117

Prediction using AlphaFold2 predicted structural models