DeeplyTough

This is the official PyTorch implementation of our paper DeeplyTough: Learning Structural Comparison of Protein Binding Sites, available from https://pubs.acs.org/doi/abs/10.1021/acs.jcim.9b00554.

DeeplyTough overview figure

Setup

Code setup

The software is ready for Docker: the image can be created from Dockerfile by running docker build -t deeplytough . (image size ~4.7GB so you may have to increase the disk space available to docker). The DeeplyTough tool is then accessible within deeplytough conda environment inside the container with source activate deeplytough.

Alternatively, environment deeplytough can be created inside local conda by executing the following steps from the root of this repository (linux only):

# create new python 3 env and activate
conda create -y -n deeplytough python=3.6
conda activate deeplytough

# install legacy version of htmd from source
curl -LO https://github.com/Acellera/htmd/archive/refs/tags/1.13.10.tar.gz && \
    tar -xvzf 1.13.10.tar.gz && rm 1.13.10.tar.gz && cd htmd-1.13.10 && \
    python setup.py install && \
    cd .. && \
    rm -rf htmd-1.13.10;

# install remaining python3 reqs
apt-get -y install openbabel
pip install --upgrade pip && pip install -r requirements.txt && pip install --ignore-installed llvmlite==0.28

# install legacy se3nn library from source
git clone https://github.com/mariogeiger/se3cnn && cd se3cnn && git reset --hard 6b976bea4ea17e1bd5655f0f030c6e2bb1637b57 && mv experiments se3cnn; sed -i "s/exclude=\['experiments\*'\]//g" setup.py && python setup.py install && cd .. && rm -rf se3cnn
git clone https://github.com/AMLab-Amsterdam/lie_learn && cd lie_learn && python setup.py install && cd .. && rm -rf lie_learn

# create python2 env used for protein structure preprocessing
conda create -y -n deeplytough_mgltools python=2.7
conda install -y -n deeplytough_mgltools -c bioconda mgltools=1.5.6

Dataset setup

Training and benchmark datasets

The tool comes with built-in support for three datasets: TOUGH-M1 (Govindaraj and Brylinski, 2018), Vertex (Chen et al., 2016), and ProSPECCTs (Ehrt et al., 2018). These datasets must be downloaded if one wishes to either retrain the network or evaluate on one of these benchmarks. The datasets can be prepared in two steps:

Set STRUCTURE_DATA_DIR environment variable to a directory that will contain the datasets (about 27 GB): export STRUCTURE_DATA_DIR=/path_to_a_dir
Run datasets_downloader.sh from the root of this repository and get yourself a coffee

This will download PDB files, extracted pockets and pre-process input features. It will also download lists of pocket pairs provided by the respective dataset authors. By downloading Prospeccts, you accept their terms of use.

Note that this is a convenience and we also provide code for data pre-processing: in case one wishes to start from the respective base datasets, pre-processing may be triggered using the --db_preprocessing 1 flag when running any of our training and evaluation scripts. For the TOUGH-M1 dataset in particular, fpocket2 is required and can be installed as follows:

curl -O -L https://netcologne.dl.sourceforge.net/project/fpocket/fpocket2.tar.gz && tar -xvzf fpocket2.tar.gz && rm fpocket2.tar.gz && cd fpocket2 && sed -i 's/\$(LFLAGS) \$\^ -o \$@/\$\^ -o \$@ \$(LFLAGS)/g' makefile && make && mv bin/fpocket bin/fpocket2 && mv bin/dpocket bin/dpocket2 && mv bin/mdpocket bin/mdpocket2 && mv bin/tpocket bin/tpocket2

Custom datasets

The tool also supports an easy way of computing pocket distances for a user-defined set of pocket pairs. This requires providing i) a set of PDB structures, ii) pockets in PDB format (extracted around bound ligands or detected using any pocket detection algorithm), iii) a CSV file defining the pairing. A toy custom dataset example is provided in datasets/custom. The CSV file contains a quadruplet on each line indicating pairs to evaluate: relative_path_to_pdbA, relative_path_to_pocketA, relative_path_to_pdbB, relative_path_to_pocketB, where paths are relative to the directory containing the CSV file and the pdb extension may be omitted. STRUCTURE_DATA_DIR environment variable must be set to the parent directory containing the custom dataset (in the example /path_to_this_repository/datasets).

Environment setup

To run the evaluation and training scripts, please first set the DEEPLYTOUGH environment variable to the directory containing this repository and then update the PYTHONPATH and PATH variables respectively:

export DEEPLYTOUGH=/path_to_this_repository
export PYTHONPATH=$DEEPLYTOUGH/deeplytough:$PYTHONPATH
export PATH=$DEEPLYTOUGH/fpocket2/bin:$PATH

Evaluation

We provide pre-trained networks in the networks directory in this repository. The following commands assume a GPU and a 4-core CPU available; use --device 'cpu' if there is no GPU and set --nworkers parameter accordingly if there are fewer cores available.

Evaluation on TOUGH-M1:

python $DEEPLYTOUGH/deeplytough/scripts/toughm1_benchmark.py --output_dir $DEEPLYTOUGH/results --device 'cuda:0' --nworkers 4 --net $DEEPLYTOUGH/networks/deeplytough_toughm1_test.pth.tar

Evaluation on Vertex:

python $DEEPLYTOUGH/deeplytough/scripts/vertex_benchmark.py --output_dir $DEEPLYTOUGH/results --device 'cuda:0' --nworkers 4 --net $DEEPLYTOUGH/networks/deeplytough_vertex.pth.tar

Evaluation on ProSPECCTs:

python $DEEPLYTOUGH/deeplytough/scripts/prospeccts_benchmark.py --output_dir $DEEPLYTOUGH/results --device 'cuda:0' --nworkers 4 --net $DEEPLYTOUGH/networks/deeplytough_prospeccts.pth.tar

Evaluation on a custom dataset, located in $STRUCTURE_DATA_DIR/some_custom_name directory:

python $DEEPLYTOUGH/deeplytough/scripts/custom_evaluation.py --dataset_subdir 'some_custom_name' --output_dir $DEEPLYTOUGH/results --device 'cuda:0' --nworkers 4 --net $DEEPLYTOUGH/networks/deeplytough_toughm1_test.pth.tar

Note that networks deeplytough_prospeccts.pth.tar and deeplytough_vertex.pth.tar may also be used, producing different results.

Each of these commands will output to $DEEPLYTOUGH/results a CSV file with the resulting similarity scores (negative distances) as well as a pickle file with more detailed results (please see the code). The CSV files are already provided in this repository for conveniency.

Training

Training requires a GPU with >=11GB of memory and takes about 1.5 days on recent hardware. In addition, at least a 4-core CPU is recommended due to volumetric input pre-processing being an expensive task.

Training for TOUGH-M1 evaluation:

python $DEEPLYTOUGH/deeplytough/scripts/train.py --output_dir $DEEPLYTOUGH/results/TTTT_forTough --device 'cuda:0' --seed 4

Training for Vertex evaluation:

python $DEEPLYTOUGH/deeplytough/scripts/train.py --output_dir $DEEPLYTOUGH/results/TTTT_forVertex --device 'cuda:0' --db_exclude_vertex 'uniprot' --db_split_strategy 'none'

Training for ProSPECCTs evaluation:

python $DEEPLYTOUGH/deeplytough/scripts/train.py --output_dir $DEEPLYTOUGH/results/TTTT_forProspeccts --device 'cuda:0' --db_exclude_prospeccts 'uniprot' --db_split_strategy 'none' --model_config 'se_4_4_4_4_7_3_2_batch_1,se_8_8_8_8_3_1_1_batch_1,se_16_16_16_16_3_1_2_batch_1,se_32_32_32_32_3_0_1_batch_1,se_256_0_0_0_3_0_2_batch_1,r,b,c_128_1'

Note that due to non-determinism inherent to the currently established process of training deep networks, it is nearly impossible to exactly reproduce the pre-trained networks in networks directory.

Also note the convenience of an output directory containing "TTTT" will afford this substring being replaced by the current datetime.

Changelog

23.02.2020: Updated code to follow our revised JCIM paper, in particular away moving from UniProt-based splitting strategy as in our BioRxiv paper to sequence-based clustering approach whereby protein structures sharing more than 30% sequence identity are always allocated to the same testing/training set. We have also made data pre-processing more robust and frozen the versions of several dependencies. The old code is kept in old_bioarxiv_version branch, though note the legacy splitting behavior can be turned on also in the current master by setting --db_split_strategy command line argument in the scripts to uniprot_folds instead of seqclust.
08.12.2020: pinned versions of requirements and updated DockerFile and README to reflect build instructions
28.09.2021: replaced conda htmd with source build in dockerfile to relieve dependency solver (patched: 2.12.2021, also added biopython fn to remove non-protein atoms instead of VMD which is deprecated)

BenevolentAI / DeeplyTough

readme