BeeCSI-Microbiome / taxonomic_profiling_pipeline

0 stars 1 forks source link

taxonomic_profiling_pipeline

This pipeline was developed using the Snakemake workflow management system

You would need to have the Snakefile, the env folder and its contents (YAML files with environment definition), and a table with the absolute paths for forward and reverse reads files specified in config.yaml.

To run in your computer

snakemake --use-conda

To run in a High Performance Computing cluster with the SGE job scheduler:

snakemake --cluster "qsub -V -cwd -pe smp {threads}" --use-conda -j <# of jobs>

Config options:

The following attributes can be changed/specified in the config.yaml file:

Introduction:

Requirements:

This workflow requires the Conda package manager, which handles the installation of tools and their dependencies.

This workflow is written in, and therefore requires, Snakemake, which can be installed using Conda. Once Conda is installed, the following command will create a Conda environment with Snakemake (and an additional dependency, Mamba):

replacing <env-name> with a name of your choice.

Optional rarefaction subworkflow

An optional portion of the workflow will perform rarefaction on kraken2 output using a tool called Krakefaction, producing taxa discovery rate tables. You can set the perform_rarefaction flag in the config file. In order to perform this subworkflow you must perform the following to install Krakefaction:

Tools:

Instructions:

  1. First navigate to the directory containing the read files (end in .fastq.gz)
  2. Ensure you have a Kraken2 formatted database against which your samples will be classified and the path to the database is specified in config.yaml.
  3. Ensure there is a .tab file (eg. samples_new.tab) that contains all the filenames of the read files
  4. Clone the Snakemake pipeline into the current directory
    • git clone https://github.com/BeeCSI-Microbiome/taxonomic_profiling_pipeline.git
  5. Ensure the .tab file (containing the sample names) is specified in the config.yaml file
  6. Either update samples_new.tab to point to the raw data files (eg. add ../ before all the file names), or copy all the contents of the repository to the same folder where the samples are, eg. cp -r taxonomic_profiling_pipeline/* .
  7. During sequencing, it is common for the phiX bacteriophage to be included for quality and calibration. The sequences of this phage's small genome must be removed from your samples. This pipeline does this by mapping your sample reads to an indexed copy of the phiX genome. You can retrieve a copy of the phiX genome from NCBI and index it with BWA. Copy the indexed phiX genome into the directory where the Snakemake will be run.
  8. Activate the conda environment containing Snakemake
    • conda activate Snakemake
  9. Perform a dry run:
    • snakemake –nr
    • All green messages is good, errors will show up in red
  10. Run the workflow: snakemake --cluster "qsub -V -cwd -pe smp {threads}" --use-conda -j <number_of_jobs> [--latency-wait <seconds>]
    • Replace <number_of_jobs> with the number of .fastq.gz files divided by 2
    • The --latency-wait is optional. Rules sometimes raise a false error in which it says the output file has not produced when it actually has. A wait of 60s has prevented this error.

APPENDIX: Conda installation on Biocluster:

  1. Connect to the Biocluster server (VPN link)
  2. Run the following commands from your home directory:
  3. wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
  4. sh Miniconda3-latest-Linux-x86_64.sh
  5. You will be guided through a command line installation wizard:
    • Press <Enter> to continue
    • Press q to close the license or <Enter> to scroll through it until you've read it all
    • Type yes to accept the license, then presse <Enter>
    • Prepending the miniconda install location is no longer the recommended way to enable Conda, so type no and press <Enter>
  6. At this point, Conda is installed in your home directory, but if you try running it now, you'll get an error. This is because you have not told the command line interpreter that you want Conda enabled in your path. You can do this by running the following command:
    • source ~/miniconda3/etc/profile.d/conda.sh
    • This command tells the command line interpreter to run the Conda source script, which enables the conda command for your environment. You will likely want to run this command every time you connect to the Biocluster, so you are encouraged to edit your ~/.bashrc file, and add that command to the end of the file.
  7. Installation is complete, but you should ensure that it is up to the latest version. You can do this by running conda update conda
  8. Finally, configure some conda channels by running the following commands

    conda config --add channels defaults
    conda config --add channels bioconda
    conda config --add channels conda-forge


FAQ:

My sample file is not reading correctly:

1) Check that the file is formatted correctly: <Sample name><tab><sample-R1 filepath><tab><sample-R2 filepath>

- See `samples.tab` for an example. 
- Note that some editors will replace tabs with spaces which may be the cause of this error. 

How do I add Krakefaction to my PATH?

1) Below is one of several ways to add Krakefaction to your PATH.

a) Install Krakefaction according to the guidelines above.

b) Find your `.bashrc` file (it should be in your home directory or contact your IT department if you can't find it).

c) Add or append the following in parentheses to your `.bashrc` file (`export PATH="<absolute path to the directory in which krakefaction was installed>/krakefaction/bin:$PATH"`)