BeatsonLab-MicrobialGenomics / micropipe

A pipeline for high-quality bacterial genome construction using ONT sequencing
GNU General Public License v3.0
38 stars 9 forks source link
bioinformatics-pipeline genome-assembly long-read-sequencing microbial-genomics nanopore-analysis-pipeline oxford-nanopore

Logo

microPIPE: a pipeline for high-quality bacterial genome construction using ONT and Illumina sequencing

Description

microPIPE was developed to automate high-quality complete bacterial genome assembly using Oxford Nanopore Sequencing in combination with Illumina sequencing.

To build microPIPE we evaluated the performance of several tools at each step of bacterial genome assembly, including basecalling, assembly, and polishing. Results at each step were validated using the high-quality ST131 Escherichia coli strain EC958 (GenBank: HG941718.1). After appraisal of each step, we selected the best combination of tools to achieve the most consistent and best quality bacterial genome assemblies.

Please note that this pipeline does not perform extensive quality assessment of the input sequencing data. Contamination and sequencing read quality should be assessed independently to avoid problems with assembly.

Micropipe has been written in Nextflow and uses Singularity containers. It can use both GPU and CPU resources.

For more information please see our publication here: https://doi.org/10.1186/s12864-021-07767-z.


Diagram

The diagram below summarises the different steps of the pipeline (with each selected tool) and the approximate run time (using GPU basecalling, averaged over 12 E. coli isolates sequenced on a R9.4 MinION flow cell). Dashed boxes correspond to optional steps in the pipeline.

Workflow


User guide

Quick start guide

  1. Basecalling, demultiplexing and assembly workflow

nextflow main.nf --basecalling --demultiplexing --samplesheet /path/to/samples.csv --fast5 /path/to/fast5/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

  1. Demultiplexing and assembly workflow (basecalling already complete)

nextflow main.nf --demultiplexing --samplesheet /path/to/samples.csv --fastq /path/to/fastq/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

  1. Assembly only workflow (basecalling and demultiplexing already complete)

nextflow main.nf --samplesheet /path/to/samples.csv --fastq /path/to/fastq/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

An infrastructure specific guide for Zeus @ Pawsey Supercomputing Centre (Perth, Western Australia) is provided here.

Step by step user guide

0. Requirements

microPIPE has been built using Nextflow and Singularity to enable ease of use and installation across different platforms.

Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires Bash 3.2 (or later) and Java 8 (or later, up to 15) to be installed.

To install Nextflow, run the command:

wget -qO- https://get.nextflow.io | bash or curl -s https://get.nextflow.io | bash

It will create the nextflow main executable file in the current directory. Optionally, move the nextflow file to a directory accessible by your $PATH variable.

Due to the Oxford Nanopore Technologies terms and conditions, we are not allowed to redistribute the Guppy software either in its binary form or packaged form e.g. Docker or Singularity images. Therefore users will have to either install Guppy, provide a container image or start the pipeline from the basecalled fastq files.

1. Installing microPIPE

Download the microPIPE repository using the command:

git clone https://github.com/BeatsonLab-MicrobialGenomics/micropipe.git

microPIPE requires the files main.nf, nexflow.config and a samplesheet file to run.

2. Prepare the Nextflow configuration file

When a Nexflow pipeline script is launched, Nextflow looks for a file named nextflow.config in the current directory. The configuration file defines default parameters values for the pipeline and cluster settings such as the executor (e.g. "slurm", "local") and queues to be used (https://www.nextflow.io/docs/latest/config.html).

The pipeline uses separated Singularity containers for all processes. Nextflow will automatically pull the singularity images required to run the pipeline and cache those images in the singularity directory in the pipeline work directory by default or in the singularity.cacheDir specified in the nextflow.config file:

singularity {
  enabled = true
  singularity.cacheDir = '/path/to/cachedir'
}

The nextflow.config file should be modified to specify the location of Guppy using one of the following options:

An example configuration file can be found in this repository.

Two versions of the configuration file are available and correspond to microPIPE v0.8 (utilizing Guppy v3.4.3) and v0.9 (utilizing Guppy v3.6.1), as referenced in the paper.

3. Prepare the samplesheet file (csv)

The samplesheet file (comma-separated values) defines the input fastq files (Illumina [short] and Nanopore [long], and their directory path), barcode number, sample ID, and the estimated genome size (for Flye assembly). The header line should match the header line in the examples below:

  1. If using demultiplexing:
barcode_id,sample_id,short_fastq_1,short_fastq_2,genome_size
barcode01,S24,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
barcode02,S34,S34EC.filtered_1P.fastq.gz,S34EC.filtered_2P.fastq.gz,5.5m
  1. If not using demultiplexing (single isolate):
sample_id,short_fastq_1,short_fastq_2,genome_size
S24,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
  1. If using assembly only:
barcode_id,sample_id,long_fastq,short_fastq_1,short_fastq_2,genome_size
barcode01,S24,barcode01.fastq.gz,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
barcode02,S34,barcode02.fastq.gz,S34EC.filtered_1P.fastq.gz,S34EC.filtered_2P.fastq.gz,5.5m
  1. If Illumina reads are not available (--skip_illumina), do not include the two columns with the Illumina files:
barcode_id,sample_id,long_fastq,genome_size
barcode01,S24,barcode01.fastq.gz,5.5m
barcode02,S34,barcode02.fastq.gz,5.5m

4. Run the pipeline

The pipeline can be used to run:

The entire workflow from basecalling to polishing will be run. The input files will be the ONT fast5 files and the Illumina fastq files.

nextflow main.nf --basecalling --demultiplexing --samplesheet /path/to/samples.csv --fast5 /path/to/fast5/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

--samplesheet: samplesheet file
--basecalling: flag to run the basecalling step 
--demultiplexing: flag to run the demultiplexing step 
--fast5: directory containing the ONT fast5 files
--outdir: path to the output directory to be created
--datadir: path to the directory containing the Illumina fastq files
--guppy_config_gpu: Guppy configuration file name for basecalling using GPU resources (default=dna_r9.4.1_450bps_hac.cfg suitable if the Flow Cell Type = FLO-MIN106 and Kit = SQK-RBK004)
--guppy_config_cpu: Guppy configuration file name for basecalling using CPU resources (default=dna_r9.4.1_450bps_fast.cfg)
--medaka_model: Medaka model (default=r941_min_high, Available models: r941_min_fast, r941_min_high, r941_prom_fast, r941_prom_high, r10_min_high, r941_min_diploid_snp), see [details](https://github.com/nanoporetech/medaka#models)

NOTE: to use GPU resources for basecalling and demultiplexing, use the --gpu flag.

Example of samplesheet file:

barcode_id,sample_id,short_fastq_1,short_fastq_2,genome_size
barcode01,S24,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
barcode02,S34,S34EC.filtered_1P.fastq.gz,S34EC.filtered_2P.fastq.gz,5.5m

The entire workflow from basecalling to polishing will be run (excluding demultiplexing). The input files will be the ONT fast5 files and the Illumina fastq files.

nextflow main.nf --basecalling --samplesheet /path/to/samples.csv --fast5 /path/to/fast5/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

--samplesheet: path to the samplesheet file
--basecalling: flag to run the basecalling step
--fast5: path to the directory containing the ONT fast5 files
--outdir: path to the output directory to be created
--datadir: path to the directory containing the Illumina fastq files
--guppy_config_gpu: Guppy configuration file name for basecalling using GPU resources (default=dna_r9.4.1_450bps_hac.cfg suitable if the Flow Cell Type = FLO-MIN106 and Kit = SQK-LSK109)
--guppy_config_cpu: Guppy configuration file name for basecalling using CPU resources (default=dna_r9.4.1_450bps_fast.cfg)
--medaka_model: name of the Medaka model (default=r941_min_high, Available models: r941_min_fast, r941_min_high, r941_prom_fast, r941_prom_high, r10_min_high, r941_min_diploid_snp), see [details](https://github.com/nanoporetech/medaka#models)

Example of samplesheet file (containing only one sample):

sample_id,short_fastq_1,short_fastq_2,genome_size
S24,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m

The entire workflow from demultiplexing to polishing will be run. The input files will be the ONT fastq files and the Illumina fastq files.

nextflow main.nf --demultiplexing --samplesheet /path/to/samples.csv --fastq /path/to/fastq/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

--samplesheet: path to the samplesheet file
--demultiplexing: flag to run the demultiplexing step
--fastq: path to the directory containing the ONT fastq files (gzip compressed)
--outdir: path to the output directory to be created
--datadir: path to the directory containing the Illumina fastq files
--guppy_config_gpu: Guppy configuration file name for basecalling using GPU resources (default=dna_r9.4.1_450bps_hac.cfg suitable if the Flow Cell Type = FLO-MIN106 and Kit = SQK-LSK109)
--guppy_config_cpu: Guppy configuration file name for basecalling using CPU resources (default=dna_r9.4.1_450bps_fast.cfg)
--medaka_model: name of the Medaka model (default=r941_min_high, available models: r941_min_fast, r941_min_high, r941_prom_fast, r941_prom_high, r10_min_high, r941_min_diploid_snp), see [details](https://github.com/nanoporetech/medaka#models)

Example of samplesheet file:

barcode_id,sample_id,short_fastq_1,short_fastq_2,genome_size
barcode01,S24,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
barcode02,S34,S34EC.filtered_1P.fastq.gz,S34EC.filtered_2P.fastq.gz,5.5m

The assembly workflow from adapter trimming to polishing will be run. The input files will be the ONT fastq files and the Illumina fastq files.

nextflow main.nf --samplesheet /path/to/samples.csv --fastq /path/to/fastq/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

--samplesheet: path to the samplesheet file
--fastq: path to the directory containing the ONT fastq files (gzip compressed)
--outdir: path to the output directory to be created
--datadir: path to the directory containing the Illumina fastq files
--medaka_model: name of the Medaka model (default=r941_min_high, Available models: r941_min_fast, r941_min_high, r941_prom_fast, r941_prom_high, r10_min_high, r941_min_diploid_snp), see [details](https://github.com/nanoporetech/medaka#models)

Example of samplesheet file:

barcode_id,sample_id,long_fastq,short_fastq_1,short_fastq_2,genome_size
barcode01,S24,barcode01.fastq.gz,S24EC.filtered_1P.fastq.gz,S24EC.filtered_2P.fastq.gz,5.5m
barcode02,S34,barcode02.fastq.gz,S34EC.filtered_1P.fastq.gz,S34EC.filtered_2P.fastq.gz,5.5m

Optional parameters

Some parameters can be added to the command line in order to include or skip some steps and modify some parameters:

Basecalling

Quality control:

Demultiplexing:

Adapter trimming:

Filtering:

Assembly:

Polishing:

Assembly evaluation:

Structure of the output folders

The pipeline will create several folders corresponding to the different steps of the pipeline. The main output folder (--outdir) will contain the following folders:

Each sample folder will contain the following folders:

Example data

To test the pipeline, we have provided some test data. In this directory you will find:

File Description
S24EC_1P_test.fastq.gz Illumina reads 1st pair
S24EC_2P_test.fastq.gz Illumina reads 2nd pair
barcode01.fastq.gz ONT fastq reads
samples_1.csv sample sheet for running assembly-only pipeline

To test the assembly-only pipeline, edit the sample_1.csv samplesheet to point to the correct test files. Then run:

nextflow main.nf --samplesheet /path/to/samples_1.csv --outdir /path/to/test_outdir/

Infrastructure usage and recommendations

General recommendations for using microPIPE

When using microPIPE to run the Oxford Nanopore data basecalling and demultiplexing, it is recommended to use the GPU resources. As a result, the basecalling step will be performed using the high accuracy model (instead of the fast model) and the workflow will complete faster than with only the CPU resources.

To use GPU resources for basecalling and demultiplexing, use the --gpu flag in the main nextflow command:

nextflow main.nf --gpu true --basecalling --demultiplexing --samplesheet /path/to/samples.csv --fast5 /path/to/fast5/directory/ --datadir /path/to/datadir/ --outdir /path/to/outdir/

Compute resource usage across tested infrastructures

The table below summarised the basecalling run time depending on the resources used at the Pawsey Supercomputing Centre.

Resources (Cluster) Basecalling model Guppy Configuration file Run time
GPU (Pawsey Topaz) high-accuracy dna_r9.4.1_450bps_hac.cfg 10h 17m 17s
CPU (Pawsey Zeus) fast dna_r9.4.1_450bps_fast.cfg 3d 19h 21m 31s

Benchmarking

Summary

Exemplar 1: Assembly of 12 E.coli ST131 samples using GPU and CPU resources @ Pawsey

Strain Chromosome/plasmid Size (bps) Circularised?
S24EC Chromosome
Plasmid A
5078304
114708
Yes
Yes
S34EC Chromosome
Plasmid A
Plasmid B
5050427
153321
108135
Yes
Yes
Yes
S37EC Chromosome
Plasmid A
Plasmid B
4981928
157642
61072
Yes
Yes
Yes
S39EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
Plasmid E
Plasmid F
5054402
141007
94979
68049
62085
2070
1846
Yes
Yes
Yes
Yes
Yes
Yes
Yes
S65EC Chromosome
Plasmid A
5205011
147412
Yes
Yes
S96EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5069496
164355
115965
14479
4184
Yes
Yes
Yes
Yes
Yes
S97EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5178868
166099
96788
4092
3209
Yes
Yes
Yes
Yes
Yes
S112EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
5020013
161028
68847
5338
4136
Yes
Yes
Yes
Yes
Yes
S116EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
4989207
66792
5263
4257
4104
Yes
Yes
Yes
Yes
Yes
S129EC Chromosome
Plasmid A
Plasmid B
Plasmid C
Plasmid D
Plasmid E
Plasmid F
Plasmid G
5193964
163681
93505
33344
4087
2401
2121
1571
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
EC958 Chromosome
Plasmid A
Plasmid B
Plasmid C
5126816
136157
4145
1830
Yes
Yes
Yes
Yes
HVM2044 Chromosome
Plasmid A
Plasmid B
Plasmid C
5003288
142959
18716
18345
Yes
Yes
Yes
Yes

Exemplar 2: Assembly of 12 E.coli ST131 samples using CPU resources @ Pawsey


Workflow summaries

Metadata

metadata field workflow_name / workflow_version
Version v0.9
Maturity stable
Creators Valentine Murigneux, Leah W Roberts, Brian M Forde, Minh-Duy Phan, Nguyen Thi Khanh Nhu, Adam D Irwin, Patrick N A Harris, David L Paterson, Mark A Schembri, David M Whiley, Scott A Beatson
Source https://github.com/BeatsonLab-MicrobialGenomics/micropipe
License https://github.com/BeatsonLab-MicrobialGenomics/micropipe/blob/main/LICENSE
Workflow manager NextFlow
Container Singularity
Install method Manual
GitHub https://github.com/BeatsonLab-MicrobialGenomics/micropipe
bio.tools NA
BioContainers NA
bioconda NA

Component tools

Workflow element Workflow element version Workflow title
Guppy v3.6.1 microPIPE v0.9
qcat v1.0.1 microPIPE v0.9
rasusa v0.3.0 microPIPE v0.9
pycoQC v2.5.0.23 microPIPE v0.9
Porechop v0.2.3 microPIPE v0.9
Filtlong v0.2.0 microPIPE v0.9
Japsa v1.9-01a microPIPE v0.9
Flye v2.5 microPIPE v0.9
Racon v1.4.9 microPIPE v0.9
Medaka v0.10.0 microPIPE v0.9
NextPolish v1.1.0 microPIPE v0.9
Circlator v1.5.5 microPIPE v0.9
QUAST v5.0.2 microPIPE v0.9

Third party tools / dependencies

Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires Bash 3.2 (or later) and Java 8 (or later, up to 15) to be installed.

To install Nextflow, run the command:

wget -qO- https://get.nextflow.io | bash or curl -s https://get.nextflow.io | bash

It will create the nextflow main executable file in the current directory. Optionally, move the nextflow file to a directory accessible by your $PATH variable.

Due to the Oxford Nanopore Technologies terms and conditions, we are not allowed to redistribute the Guppy software either in its binary form or packaged form e.g. Docker or Singularity images. Therefore users will have to either install Guppy, provide a container image or start the pipeline from the basecalled fastq files.


Additional notes


Help / FAQ / Troubleshooting


Licence(s)

https://github.com/BeatsonLab-MicrobialGenomics/micropipe/blob/main/LICENSE


Acknowledgements / citations / credits

Citations

Acknowledgements

This work was supported by funding from the Queensland Genomics Health Alliance (now Queensland Genomics), Queensland Health, the Queensland Government.

The deployment of the workflow at the Pawsey Supercomputing Centre was supported by the Australian BioCommons via funding from Bioplatforms Australia, the Australian Research Data Commons (https://doi.org/10.47486/PL105) and the Queensland Government RICF programme. Bioplatforms Australia and the Australian Research Data Commons are funded by the National Collaborative Research Infrastructure Strategy (NCRIS).