KosinskiLab / AlphaPulldownSnakemake

GNU General Public License v3.0
3 stars 0 forks source link

A snakemake pipeline for automated structure prediction using various backends.

Installation

Before installation, make sure your python version is at least 3.10.

python3 --version
  1. Install Dependencies

    pip install snakemake==7.32.4 snakedeploy==0.10.0 pulp==2.7 click==8.1 cookiecutter==2.6
  2. Snakemake Cluster Setup

    In order to allow snakemake to interface with a compute cluster, we are going to use the Snakemake-Profile for SLURM. If you are not working on a SLURM cluster you can find profiles for different architectures here. The following will create a profile that can be used with snakemake and prompt you for some additional information.

    git clone https://github.com/Snakemake-Profiles/slurm.git
    profile_dir="${HOME}/.config/snakemake"
    mkdir -p "$profile_dir"
    template="gh:Snakemake-Profiles/slurm"
    cookiecutter --output-dir "$profile_dir" "$template"

    During the setup process, you will be prompted to answer several configuration questions. Below are the questions and the recommended responses:

    • profile_name [slurm]: slurm_noSidecar
    • Select use_singularity: 1 (False)
    • Select use_conda: 1 (False)
    • jobs [500]: (Press Enter to accept default)
    • restart_times [0]: (Press Enter to accept default)
    • max_status_checks_per_second [10]: (Press Enter to accept default)
    • max_jobs_per_second [10]: (Press Enter to accept default)
    • latency_wait [5]: 30
    • Select print_shell_commands: 1 (False)
    • sbatch_defaults []: qos=low nodes=1
    • Select cluster_sidecar: 2 (no)
    • cluster_name []: (Press Enter to leave blank)
    • cluster_jobname [%r_%w]: (Press Enter to accept default)
    • cluster_logpath [logs/slurm/%r/%j]: (Press Enter to accept default)
    • cluster_config []: (Press Enter to leave blank)

    After responding to these prompts, your Slurm profile named slurm_noSidecar for Snakemake will be configured as specified.

  3. Singularity (Probably Installed Already): This pipeline makes use of containers for reproducibility. If you are working on the EMBL cluster singularity is already installed and you can skip this step. Otherwise, please install Singularity using the official Singularity guide.

  4. Download The Pipeline: This will download the version specified by '--tag' of the snakemake pipeline and create the repository AlphaPulldownSnakemake, or any other name you choose.

    snakedeploy deploy-workflow \
      https://github.com/KosinskiLab/AlphaPulldownSnakemake \
      AlphaPulldownSnakemake \
      --tag 1.3.0
    cd AlphaPulldownSnakemake

Configuration

Adjust config/config.yaml for your particular use case.

input_files

This variable holds the path to your sample sheet, where each line corresponds to a folding job. For this pipeline we use the following format specification:

protein:N:start-stop[_protein:N:start-stop]*

where protein is a path to a file with '.fasta' extension or uniprot ID, N is the number of monomers for this particular protein and start and stop are the residues that should be predicted. However, only protein is required, N, start and stop can be omitted. Hence the following folding jobs for the protein example containing residues 1-50 are equivalent:

example:2
example_example
example:2:1-50
example:1-50_example:1-50
example:1:1-50_example:1:1-50

This format similarly extends for the folding of heteromers:

example1_example2

Assuming you have two sample sheets config/sample_sheet1.csv and config/sample_sheet2.csv. The following would be equivalent to computing all versus all in sample_sheet1.csv:

input_files :
  - config/sample_sheet1.csv
  - config/sample_sheet1.csv

while the snippet below would be equivalent to computing the pulldown between sample_sheet1.csv and sample_sheet2.csv

input_files :
  - config/sample_sheet1.csv
  - config/sample_sheet2.csv

This format can be extended to as many files as you would like, but keep in mind the number of folds will increase dramatically.

input_files :
  - config/sample_sheet1.csv
  - config/sample_sheet2.csv
  - ...

alphafold_data_directory

This is the path to your alphafold database.

output_directory

Snakemake will write the pipeline output to this directory. If it does not exist, it will be created.

save_msa, use_precomputed_msa, predictions_per_model, number_of_recycles, report_cutoff

Command line arguments that were previously pasesed to AlphaPulldown's run_multimer_jobs.py and create_notebook.py (report_cutoff).

alphafold_inference_threads, alphafold_inference

Slurm specific parameters that do not need to be modified by non-expert users.

only_generate_features

If set to True, stops after generating features and does not perform structure prediction and reporting.

Execution

After following the Installation and Configuration steps, you are now ready to run the snakemake pipeline. To do so, navigate into the cloned pipeline directory and run:

snakemake \
  --use-singularity \
  --singularity-args "-B /scratch:/scratch \
    -B /g/kosinski:/g/kosinski \
    --nv " \
  --jobs 200 \
  --restart-times 5 \
  --profile slurm_noSidecar \
  --rerun-incomplete \
  --rerun-triggers mtime \
  --latency-wait 30 \
  -n

Here's a breakdown of what each argument does:

Executing the command above will perform submit the following jobs to the cluster:

Snakemake rulegraph