Haddox / score_monomeric_designs

Computational pipeline for scoring protein designs using a variety of biophysical metrics
3 stars 1 forks source link

Pipeline for computing an array of biophysical metrics for monomeric proteins

Summary

This pipeline computes an array of biophysical metrics for an input set of monomeric proteins. The metrics include the ones included in the study Rocklin et al., 2017, Science, and much of the code was derived from that paper. There are also additional metrics that we added since then.

Note: There are many ways that this pipeline can be improved! Please feel free to make improvements that you come up with and push them to the repository. That could be anything from making better documentation, expanding the number of metrics compute, or fixing errors that cause the pipeline to crash.

Organization of code

Installing external dependencies

Carrying out the pipeline requires multiple external dependencies. Unfortunately, the full set of required external dependencies are only available on the Baker lab server at this time. This includes:

Dependencies that are installable using Conda

Nearly all dependencies are encoded in the file called environment.yml. If you're working on the Baker lab server, these dependencies can all be installed using Conda. To do so, first clone this repository. Then, in the root directory of the repository, execute the command:

conda env create -f environment.yml -n {env_name}

where env_name is whatever name you'd like to call the environment.

File with paths to external dependences that cannot be installed with Conda

Dependencies in np_aa_burial.py:

/software/rosetta/versions/v2019.01-dev60566/bin/rosetta_scripts.hdf5.linuxgccrelease

Dependencies in make_fragments.py:

fragment_tools = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/Rosetta/tools/fragment_tools/' psipred = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/psipred3.21/' scripts_dir = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/bakerlab_scripts/boinc/' nnmake = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/nnmake/pNNMAKE.gnu' csbuild = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/csbuild/' cm_scripts = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/cm_scripts/bin/' rosetta = '/work/robetta/workspace/labFragPicker_DO_NOT_REMOVE/Rosetta/main/source/bin/'

core_clusters, buried_np.xml

How to run the pipeline

Running the pipeline involves two command-line arguments:

First, activate the Conda environment described above using the command:

source activate {environment_name}

where environment_name is the name of the environment. This will give the pipeline access to many of the required external dependencies.

Second, use jug to execute the pipeline:

jug execute --jugdir {jugdir} scripts/score_designs.py {path_to_directory_with_input_pdbs} {path_to_output_directory}

Example Jupyter notebook that runs the pipeline with a handful of test PDBs

The notebook called test_scoring.ipynb runs the pipeline on ten structures from Rocklin et al., and then checks to see that the results from the pipeline match the results of the publication.

How to run the pipeline using sbatch

You can use sbatch to parallelize the jobs across multiple CPUs. For an example of a file that can be used as input for sbatch, see the file called results/test_scoring/run.sbatch, which is generated by test_scoring.ipynb. To submit the job to slurm, the notebook simply executes the command: sbatch results/test_scoring/run.sbatch.

Checking the status of the run

You can also use jug to check the "status" of the job, i.e., how many jobs have and have not been completed:

jug status --jugdir {jugdir} scripts/score_designs.py {path_to_folder_with_pdbs} {path_to_output_folder}

Unfortunately, this takes a while since it still needs to initialize all the Rosetta XML filters, which takes several minutes. I haven't figured out a way around this.

What to do if the pipeline fails

Sometimes the pipeline fails in ways that are obvious, where the script raises a clear error (e.g., KeyError). In these cases, the code needs to be changed in order for it to work correctly. Of note: any temporary files and directories should be deleted before rerunning the pipeline (see below).

However, there are other times when the pipeline fails for reasons that are still mysterious to me. In these cases, I can delete temporary files and directories, rerun the pipeline immediately, and it works.

In each of the above cases, I suggest deleting all temporary files and directories before rerunning the pipeline. Below is a list of possible directories/files to check for:

Description of metrics

Most metrics are described in the supplemental material of Rocklin, 2017, Science. Below are descriptions of some of the new metrics:

Metrics related to fragment quality:

Metrics related to per-residue Rosetta energies of fragments in primary sequence:

Metrics related to per-residue energies of 3D neighborhoods:

Metrics related to counts of pairwise amino-acid contacts in 3D space:

Metrics related to buried hydrophobic surface area:

Metrics related to side-chain entropy:

Metrics related to packing and cavities

Metrics related to ABEGO types