Eco-Flow / pollen-metabarcoding

A pipeline developled in collaboration with Exeter University
A pipeline developed in collaboration with Exeter University


Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.



To install the pipeline please use the following commands but replace VERSION with a release.

wget -O - | tar -xvf -


curl -L --output - | tar -xvf -

This will produce a directory in the current directory called pollen-metabarcoding-VERSION which contains the pipeline.




Module parameters (check software docs for more details)

AWS parameters (ensure these match the infrastructure you have access to if using AWS)

Creating a input sample sheet

In order to create an input sample sheet in the correct format, you can use the python script -> here.

This has been edited from an nf-core rnaseq script.

You can use it when all your fastq files are in a single folder and end with _R1_001.fastq.gz and/or _R2_001.fastq.gz.


python3 /path/to/fastq/files Input.csv


Once completed, your output directory should be called results (unless you specified another name) and should contain the following directory structure:

├── cut_tsvs
├── cutadapt
│   ├── fastqs
│   └── logs
├── pear
│   ├── assembled
│   ├── discarded
│   └── unassembled
├── pipeline_info
│   ├── co2_emissions
│   │   ├── co2footprint_report.html
│   │   ├── co2footprint_summary.html
│   │   └── co2footprint_trace.txt
│   ├── execution_report.html
│   ├── execution_timeline.html
│   ├── execution_trace.txt
│   ├── pipeline_dag.html
│   └── software_versions.yml
├── r-processing
│   └── sample
│       ├── classified.tsv
│       ├── pie_charts
│       │   ├── family.pdf
│       │   ├── genus.pdf
│       │   └── order.pdf
│       └── summary.tsv
├── sratools_fasterq-dump
│   └── sample
├── usearch
│   └── sintax_summary
│       ├── sample
│           ├── class_summary.txt
│           ├── domain_summary.txt
│           ├── family_summary.txt
│           ├── genus_summary.txt
│           ├── kingdom_summary.txt
│           ├── order_summary.txt
│           ├── phylum_summary.txt
│           └── species_summary.txt
└── vsearch
    ├── derep
    │   ├── clusterings
    │   ├── fastas
    │   └── logs
    ├── fastq_filter
    │   ├── fastas
    │   └── logs
    └── sintax

cut_tsvs - directory containing tsvs of first 2 columns of sintax data


  1. fastqs - directory containing adapter trimmed fastqs files for each sample.
  2. logs - directory containing cutadapt trimming statistics for each sample.


  1. assembled - directory containing fastqs of successfully merged reads for each sample.
  2. discarded - directory containing fastqs of reads disacrded due to quality for each sample.
  3. unassembled - directory containing fastqs of reads unable to be merged for each sample.

pipeline_info - directory containing pipeline statistics including co2 emissions.


  1. classfied.tsv - tsv containing taxonomy prediction information.
  2. pie_charts - pdfs of top predicted species for different taxonomic level
  3. summary.tsv - tsv containing summary statistics.

sratools_fasterq-dump - fastqs obtained from SRA ID.

usearch - text files containing the name, number of reads, percentage of reads and cumulative percentage of reads for each taxonomic level.


  1. derep
    • clusterings - directory containing dereplicated clusterings for each sample.
    • fastas - directory containing dereplicated fastas for each sample.
    • logs - directory containing vsearch dereplicate statistics for each sample.
  2. fastq_filter
    • fastas - directory containing filtered fastas for each sample.
    • logs - directory containing vsearch fastq_filter statistics for each sample.
  3. sintax - directory containing vsearch sintax taxonomy prediction output files.



The basic configuration of processes using labels can be found in conf/base.config.

Module specific configuration using process names can be found in conf/modules.config.

Please note: The nf-core CUTADAPT module is labelled as process_medium in the module However for pollen metabarcoding data the fastqs are significantly smaller, so this resource requirement has been overwritten inside conf/modules.config to match the process_single resource requirments.


This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2.

Container Profiles

Please select one of the following profiles when running the pipeline.

Optional Profiles

Custom Configuration

If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.

Running the Pipeline

Please note: The -resume flag uses previously cached successful runs of the pipeline.

(The example database was obtained from molbiodiv/meta-barcoding-dual-indexing).

The database should be a list of fasta sequences, where the header name contains kingdom (k), phylum (p), c (class), o (order), f (family), g (genus) and s (species) identifiers (separated by comma). If your database does not contain all these definitions the pipeline will fail. We currently have a branch that will work with k (kindgom), called 'kingdom_fix'. To use this, clone the repo with the --branch kingdom_fix flag.

Test Data

The data used to test this pipeline via the ENA ID: PRJEB26439. There are two test profiles using this data: test_small - contains 3 samples for small, fast testing. test_full - contains 47 samples (the entire dataset) for large, real-world replication testing.

Contact Us

If you need any support do not hesitate to contact us at any of:

simon.murray [at]

c.wyatt [at]

ecoflow.ucl [at]