f-heeger / two_marker_metabarcoding

Pipeline for metabarcoding of fungi with more than one marker (5.8S and ITS2)
GNU Lesser General Public License v3.0
6 stars 1 forks source link

Two marker metabarcoding pipeline

Snakemake pipeline for analysis of metabarcoding data of fungi with more than one marker (5.8S and ITS2).

The pipeline (v1.1) is described in out paper Combining the 5.8S and ITS2 to improve classification of fungi.

The current version (v1.3.1) can be found on the release page. It differs from version in the paper in that it uses snakemakes conda integration for better portability and easy installation. Additionally input file handling is handled in a way that is less reliant on default Illumina file names and allows for more complicated experimental setups (e.g. the same sample in multiple runs). I also changed some external dependencies. See the release notes for further detail.

There is also a pre-print available on BioRxiv describing version 1.0 of this pipeline: https://doi.org/10.1101/532358

Prerequisites

Software you need to install

You need to install a version of Conda, that will be used to install all other software automatically except Snakemake and Python. You can use Conda to install Snakemake (which will automatically installs Python).

Software that will be installed automatically

The following tools will be automatically installed by Snakemake trough conda. Make sure that Conda is in your path, when you run Snakemake.

Reference Data

Reference data will be downloaded and processed automatically

Installation

Preparing your working directory

You can directly download the files into your working directory or clone the repository with git. Your working folder should contain the following files beside this readme:

Setting up your configuration file

The config file is in json format and is a list of keys (or names) and values separated by a ":".

The configuration file gives the paths to all necessary data as well as some information about your run and additional configuration. You need to change the information about your run, the software paths and your e-mail address (for querying the NCBI database).

Your e-mail adresse:

to build the database the pipeline will query the NCBI taxonomy database. NCBI requires automated API calls like this to also send an e-mail address. Your e-mail address will not be used for anything else and will be send nowhere except NCBI.

Run the pipeline with test data:

After setting your e-mail address, you can run the pipeline on the provided test data to see if it is working correctly. This will also already download the necessary reference data bases (see Reference Data below on how to set the database version) and install needed software via conda.

To do a test run, open a terminal, navigate to the pipeline working directory and type the following command: snakemake --use-conda -s metabarcoding.snakemake.py. After everything is done you should see the message: 72 of 72 steps (100%) done.

Information about your run:

Configuration of pipeline behavior

Reference Data

Setting up your samples file

The sample file is a table in tab-separated-values format providing information about your input files. Lines with a # in front of them will be ignored. One file is described per row with the following columns:

Running the pipeline

The repository also contains a bash script called run.sh. I recommend using the above command unless you are familiar with bash scripting and conda. If you want to use it, you have to

All parameters you give to the run.sh script (e.g. -j 6) will be passed on to snakemake.

Output Files

The pipeline creates different folders for the intermediate files after each step and the final result files. The most important result is otu_table.tsv that is created in the working directory. It contains the number of reads that were assigned to the OTUs in each sample as well as the taxonomic classification of the OTUs. Each row represents one OTU. The first column is the OTU ID, the second, third and forth column are the taxonomic classifications by 5.8S, ITS2 and both combined respectively. The following columns contain the number of reads per sample (see header for sample order).

Other interesting outputs are the rarefaction plot (All.rarefactions.pdf), the plot of read numbers after each filtering step during the initial read processing (readNumbers/readNumbers.pdf) and the Krona plots of the read assignments per sample (krona/All.krona.html).

Hidden Features

There are some non-standard features, that you will not need for a typical analysis, but that I will document here: