HighlanderLab / tree_seq_pipeline

Pipeline to infer tree sequences with different datasets
MIT License
3 stars 7 forks source link

README Snakemake Tree sequence pipeline

How to run this pipeline on Eddie

  1. (If not done yet) Configure conda by Adding the environment directory (read more about it here, also check for updates: https://www.wiki.ed.ac.uk/display/ResearchServices/Anaconda):

    module load anaconda
    conda config --add envs_dirs /exports/cmvm/eddie/eb/groups/HighlanderLab/anaconda/envs

    And adding the pkg directory: conda config --add pkgs_dirs /exports/cmvm/eddie/eb/groups/HighlanderLab/anaconda/pkg

    Make sure your ~/.condarc contains the following lines:

    envs_dirs:
    - /exports/cmvm/eddie/eb/groups/HighlanderLab/anaconda/envs
    pkgs_dirs:
    - /exports/cmvm/eddie/eb/groups/HighlanderLab/anaconda/pkg
  2. Add drmaa path to your ~/.bash_profile or equivalent: export DRMAA_LIBRARY_PATH=/exports/applications/gridengine/ge-8.6.5/lib/lx-amd64/libdrmaa.so

On Eddie do

# Initiate an interactive login
qlogin -l h_vmem=32G

# load anaconda and activate snakemake environment
module load anaconda/5.3.1
conda deactivate # there should be no (env) in your prompt!
conda activate snakemake

# Go to your workspace
cd path/to/your/workspace

# clone this
git clone https://github.com/HighlanderLab/tree_seq_pipeline.git
cd tree_seq_pipeline

# Create a config file for you dataset. You can use `Snakemake/config/beetest.yaml` as a template.
- change o_dir to a location in your workspace (this is the place where the output folder is going to be created)
- change vcf_dir to the folder where your input vcfs are stored (and ancestral inference input, if required)
- all other paths are relative to vcf_dir
- change meta to the (relative) path where your metafile is stored
- change ancestralAllele to the (relative) path where your ancetralfile is stored

# pipeline is executed from the workflow folder
cd Snakemake/workflow

# for interactive use
snakemake -j 1 --use-conda -n --configfile path/to/config.yaml # for a dry run
snakemake -j 1 --use-conda --configfile path/to/config.yaml # real run
snakemake -j 1 --use-conda -F --configfile path/to/config.yaml # force run (overwrites existing outputs)

# to submit to cluster use
# N = number of chromosomes (so they run in parallel)
snakemake -j N --use-conda --configfile path/to/config.yaml --drmaa " -l h_vmem=32G" --jobscript jobscript.sh &
snakemake -j N --use-conda --configfile path/to/config.yaml --drmaa " -l h_vmem=32G" --jobscript jobscript.sh -F &

# the commands listed here are only an example.
# the --drmaa flag takes the same inputs as qsub, in that way, other options can be use in addition to -l h_vmem.

Test data

A test data set is now included within the repo in: TestDataBee/

The folder contains:

Running the pipeline on the bee test data

Important notes

About config files

Snakemake config files are in YAML format. You can specify multiple config files for a run if desired. This can be hardcoded inside the Snakefile. But it is more flexible to specify configfiles on the commandline via --configfile. You can also supply individual key-value pairs on the command line using --config, this takes priority over what's stated the config file(s). Useful, e.g., to change the output directory --config o_dir="../../myOutputDir".

Description of the config files

We used to have two config files the contents of which are described below. Both are merged in the bee example (see Snakemake/config/beetest.yaml) Important settings are:

All the following file paths are relative to the vcf_dir set in vcf_dir above:

Tsinfer parameters (these here work for bee):

Description of rules and workflow

  1. Split snakemake file INPUT: You can start with files that are already split or files that are still combined (meaning, all genome in one VCF). We do require the files to be named a certain way.
    • files that are already split = Chr{N}_{whatever}.vcf.gz
    • files that are still combined = Combined_{whatever}.vcf.gz

The input VCF files must also be compressed and indexed!

RULES:

OUTPUT: The rule creates a folder for each chromosome under the config['vcfdir'] directory. Within each folder, there are split files from each of the input files.

  1. Merge_samples_vcf snakemake file INPUT: The input are the split VCF files created by the split snakemake file. That means, there is one folder per chromosome under config['vcfdir'] that includes all the VCFs for that chromosome.

RULES: If there is more than one file to start with, then files need to be merge into one VCF:

If there is only one file to start with, there is nothing to merge and this file just needs to be renamed.

OUTPUT: A single vcf per chromosome with all the samples named Chr_final.vcf.gz in the config['vcfdir] directory and its index file

  1. Prepare_files_for_tsinfer snakemake file INPUT: The final files split by chromosome from the merge rule AND a file with ancestral allele information. The ancestral file has to have chr_pos and AA, split with a tab.

RULES:

OUTPUT: One VCF per chromosome Chr_ancestral.vcf.gz in the config['vcfdir'] with the added ancestral information and its index file

  1. Phasing snakemake file INPUT: The VCF with the ancestral information from the previous snakemake file, and a genetic (recombination map).

RULES: If the ploidy is 1, that we don't have to phase the file:

If the ploidy is more than 1, than the files need to be phased:

OUTPUT: One VCF per chromosome Chr_phased.vcf.gz in the config['vcfdir'] with the added ancestral information and its index file

  1. Infer_trees snakemake file INPUT: One phased VCF file per chromosome with ancestral information AND the file with the meta information about the samples. The meta files must be comma separated file with four columns (some can be empty): "ID", "Pop", "SubPop", and "Time".

RULES:

OUTPUT: The output is one tree sequence for each chromosome in the ../Project/Tsinfer/trees directory.

NOTE: We chose to work with conda environments. Currently, we are using the following environments:

An alternative to using the conda environment is to load envmodules through (only for Eddie modules)

envmodules:

#     config['bcftoolsModule']

DAG

The specific workflow DAG daw will depend on the input files and config. Here is one example, starting with a combined (i.e., multi-chromosome) VCF file:
exampleDag