awfderry / COLLAPSE

Representation learning for protein functional site analysis
MIT License
8 stars 2 forks source link

COLLAPSE

COLLAPSE (COmpressed Latents Learned from Aligned Protein Structural Environments) is a representation learning method for protein structural and functional sites, as described in Derry et al. (2022). This repo contains all package functionality as well as scripts for functional site search and annotation, pre-training, and transfer learning on Prosite and ATOM3D datasets. For more details on COLLAPSE, please see our paper. Preprint is also freely available on BioRxiv.

The repo is organized as follows:

Requirements

We recommend installing COLLAPSE in a Conda environment (tested with GCC version 10.1.0). To create and activate your Conda environment, run the following:

conda create -n collapse python=3.9
conda activate collapse

To download required packages on a machine with GPU with CUDA v11.7 (recommended), run the following script:

./install_dependencies.sh

For CPU-only functionality, you can run the following:

./install_dependencies_cpu.sh

Scripts may require additional dependencies, which may be installed using conda or pip as needed.

Installation

Install the package using pip:

pip install .

Downloading datasets

Datasets are hosted on Zenodo. The following datasets are available for download depending on your use case.

To download the dataset, download directly from Zenodo or use the following script, where FILENAME is the name of the file in Zenodo (e.g. checkpoints.tar.gz):

cd data
bash download_data.sh FILENAME

Usage

Here we provide usage examples for several applications of COLLAPSE.

Embed all residues in a PDB file

To embed all residues in a single structure using COLLAPSE, use the following lines of code. Here, PDB_FILE is the path to the PDB file containing the structure to be embedded, and DEVICE specifies where you want the embedding to run: cpu (default) or cuda In this example, we only embed chain "A" and include all heteroatoms (ligands, ions, and cofactors).

from collapse import process_pdb, initialize_model, embed_protein

# Create model and load default parameters (pre-trained COLLAPSE model)
model = initialize_model(device=DEVICE)
# Load PDB file and pre-process to dataframe representation
atom_df = process_pdb(PDB_FILE, chain='A')
# Embed protein, returning dictionary of embeddings and metadata
emb_data = embed_protein(atom_df, model, include_hets=True, device=DEVICE)

The output of embed_protein is a dictionary containing the following data:

Embed entire dataset of PDB files

To embed all residues of all structures in a directory of PDB files, use the following script. PDB_DIR is the root directory of all the PDB files to be processed, possibly containing subdirectories. Accepted formats include pdb, pdb.gz, and cif. OUT_DIR is the location of the processed dataset.

python embed_pdb_dataset.py PDB_DIR OUT_DIR --filetype pdb

This script produces an embedding dataset in the LMDB format, allowing for compressed, fast, random access to all elements in the database, in which data is stored in a key-value format. Each element of the dataset produced by embed_pdb_dataset.py has the same keys as the outpute of embed_protein (see above), in addition to the following data from the initial PDB file:

To load this dataset in a Pytorch-style dataset format, you can use ATOM3D:

from atom3d.datasets import load_dataset
dataset = load_dataset(OUT_DIR, 'lmdb')

Additional arguments are:

If processing in chunks, each chunk of the processed dataset is stored in a tmp_ directory. To combine these into a full processed dataset, you can use the following script from ATOM3D:

python -m atom3d.datasets.scripts.combine_lmdb OUT_DIR/tmp_* OUT_DIR/full

Iterative search of functional site against PDB database

Given the structural site defined by a specific residue in a PDB file, you can search against a structure database using the following command.

python search_site.py PDB_FILE CHAIN RESID DATABASE

Additional arguments are:

Output file is a CSV with resulting PDBs, residue IDs, protein metadata, quantile-transformed cosine similarity, and the iteration and query where the result first appeared. The first row contains the query structure and residue ID.

Example:

python search_site.py data/examples/1a0h.pdb B H363 data/datasets/pdb100_embeddings/pdb_embeddings.pkl --cutoff 1e-3 --verbose --num_iter 3

Annotate structure using functional site database

To annotate chains A and B in the structure stored in PDB_FILE, use the following command. The output will be a printed summary of the functional sites detected and the corresponding residues. You can also supply more than one PDB file to be annotated, each separated by a space. By default, the functional site database contains conserved residues from Prosite and the Catalytic Site Atlas (CSA).

python annotate_pdb.py PDB_FILE --chains AB

Additional arguments are:

License

This project is licensed under the MIT license

References

If you use COLLAPSE, please cite our manuscript:

Derry, A., & Altman, R. B. (2022). COLLAPSE: A representation learning framework for identification and characterization of protein structural sites. Protein Science 32(2). e4541. https://doi.org/10.1002/pro.4541.