Eco-Flow / pollen-metabarcoding

A pipeline developled in collaboration with Exeter University
1 stars 0 forks source link

nf-pollen-metabarcoding

A pipeline developed in collaboration with Exeter University

Installation

Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.

Prerequisites

Install

To install the pipeline please use the following commands but replace VERSION with a release.

wget https://github.com/Eco-Flow/pollen-metabarcoding/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -

or

curl -L https://github.com/Eco-Flow/pollen-metabarcoding/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -

This will produce a directory in the current directory called pollen-metabarcoding-VERSION which contains the pipeline.

Inputs

Required

Optional

Module parameters (check software docs for more details)

AWS parameters (ensure these match the infrastructure you have access to if using AWS)

Creating a input sample sheet

In order to create an input sample sheet in the correct format, you can use the python script -> here.

This has been edited from an nf-core rnaseq script.

You can use it when all your fastq files are in a single folder and end with _R1_001.fastq.gz and/or _R2_001.fastq.gz.

Usage:

python3 fastq_dir_to_samplesheet.py /path/to/fastq/files Input.csv

Results

Once completed, your output directory should be called results (unless you specified another name) and should contain the following directory structure:

results
├── cut_tsvs
├── cutadapt
│   ├── fastqs
│   └── logs
├── pear
│   ├── assembled
│   ├── discarded
│   └── unassembled
├── pipeline_info
│   ├── co2_emissions
│   │   ├── co2footprint_report.html
│   │   ├── co2footprint_summary.html
│   │   └── co2footprint_trace.txt
│   ├── execution_report.html
│   ├── execution_timeline.html
│   ├── execution_trace.txt
│   ├── pipeline_dag.html
│   └── software_versions.yml
├── r-processing
│   └── sample
│       ├── classified.tsv
│       ├── pie_charts
│       │   ├── family.pdf
│       │   ├── genus.pdf
│       │   └── order.pdf
│       └── summary.tsv
├── sratools_fasterq-dump
│   └── sample
├── usearch
│   └── sintax_summary
│       ├── sample
│           ├── class_summary.txt
│           ├── domain_summary.txt
│           ├── family_summary.txt
│           ├── genus_summary.txt
│           ├── kingdom_summary.txt
│           ├── order_summary.txt
│           ├── phylum_summary.txt
│           └── species_summary.txt
└── vsearch
    ├── derep
    │   ├── clusterings
    │   ├── fastas
    │   └── logs
    ├── fastq_filter
    │   ├── fastas
    │   └── logs
    └── sintax

cut_tsvs - directory containing tsvs of first 2 columns of sintax data

cutadapt

  1. fastqs - directory containing adapter trimmed fastqs files for each sample.
  2. logs - directory containing cutadapt trimming statistics for each sample.

pear

  1. assembled - directory containing fastqs of successfully merged reads for each sample.
  2. discarded - directory containing fastqs of reads disacrded due to quality for each sample.
  3. unassembled - directory containing fastqs of reads unable to be merged for each sample.

pipeline_info - directory containing pipeline statistics including co2 emissions.

r-processing

  1. classfied.tsv - tsv containing taxonomy prediction information.
  2. pie_charts - pdfs of top predicted species for different taxonomic level
  3. summary.tsv - tsv containing summary statistics.

sratools_fasterq-dump - fastqs obtained from SRA ID.

usearch - text files containing the name, number of reads, percentage of reads and cumulative percentage of reads for each taxonomic level.

vsearch

  1. derep
    • clusterings - directory containing dereplicated clusterings for each sample.
    • fastas - directory containing dereplicated fastas for each sample.
    • logs - directory containing vsearch dereplicate statistics for each sample.
  2. fastq_filter
    • fastas - directory containing filtered fastas for each sample.
    • logs - directory containing vsearch fastq_filter statistics for each sample.
  3. sintax - directory containing vsearch sintax taxonomy prediction output files.

Configuration

Basics

The basic configuration of processes using labels can be found in conf/base.config.

Module specific configuration using process names can be found in conf/modules.config.

Please note: The nf-core CUTADAPT module is labelled as process_medium in the module main.nf. However for pollen metabarcoding data the fastqs are significantly smaller, so this resource requirement has been overwritten inside conf/modules.config to match the process_single resource requirments.

Profiles

This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2.

Container Profiles

Please select one of the following profiles when running the pipeline.

Optional Profiles

Custom Configuration

If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.

Running the Pipeline

Please note: The -resume flag uses previously cached successful runs of the pipeline.

(The example database was obtained from molbiodiv/meta-barcoding-dual-indexing).

The database should be a list of fasta sequences, where the header name contains kingdom (k), phylum (p), c (class), o (order), f (family), g (genus) and s (species) identifiers (separated by comma). If your database does not contain all these definitions the pipeline will fail. We currently have a branch that will work with k (kindgom), called 'kingdom_fix'. To use this, clone the repo with the --branch kingdom_fix flag.

Test Data

The data used to test this pipeline via the ENA ID: PRJEB26439. There are two test profiles using this data: test_small - contains 3 samples for small, fast testing. test_full - contains 47 samples (the entire dataset) for large, real-world replication testing.

Contact Us

If you need any support do not hesitate to contact us at any of:

simon.murray [at] ucl.ac.uk

c.wyatt [at] ucl.ac.uk

ecoflow.ucl [at] gmail.com