emilytrybulec / argonaut

pipeline for genome assembly, accepting ONT and/or PacBio HiFi long reads as well as Illumina short reads
MIT License
4 stars 1 forks source link

Nextflow run with singularity Launch on Nextflow Tower

Introduction

Argonaut performs automated reads to genome operations for de novo assemblies; it is a bioinformatics pipeline that performs genome assembly on long and short read data. A fastq file and input information is fed to the pipeline, resulting in final assemblies with completeness, contiguity, and correctnesss quality checking at each step. The pipeline accepts short reads, long reads, or both.

Table of Contents

Pipeline Summary

Illumina Short Read

  1. Read QC, Adaptor Trimming, Contaminant Filtering(FastQC v0.11.9, FastP v0.23.4, GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1,)

PacBio HiFi Long Read (CCS format)

  1. Read QC, Adaptor Trimming, Contaminant Filtering(Nanoplot v1.41.0,CutAdapt v3.4,GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1)
  2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)

ONT Long Read

  1. Read QC and Contaminant Filtering(Nanoplot v1.41.0,KmerFreq, GCE, Centrifuge v1.0.4, Recentrifuge v1.9.1)
  2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)

All reads are used for the following steps:

Argonaut Hybrid Workflow
  1. Assembly

  2. Polish

  3. Purge

  4. Scaffolding

  5. Quality Checking

  6. Assembly Visualization

To the right is a figure detailing the major workflow steps involved in hybrid assembly.

If you indicate that you would like for long read polishers to be run, the pipeline will default to using PacBio HiFi reads, and using ONT if no PacBio HiFi is available. If short reads are also available, they will automatically be used to polish the assemblies after long read polishing (or assembly if long read polishing is off).

Purge Haplotigs is the first step of manual curation, as it produces a histogram that needs to be analyzed for -l, -m, -h flags. The pipeline will stop at the purge step if purge is activated in your configuration and wait for manual input of parameters according to the histogram of your assembly, which can be found in your out directory.

Quick Start

Installation

Only Nextflow and Singularity need to be installed to run Argonaut. Users that would like to run Centrifuge and/or Kraken2 will need to provide a database. There are similar restrictions for running Recentrifuge and Blobtools with Blast and NCBI taxdump. Follow the links provided for database download directions. Xanadu users running Argonaut at the University of Connecticut may use the database paths provided in the example params.yaml

Samplesheets

To get started setting up your run, prepare a samplesheet with your input data as follows:

ont_samplesheet.csv:

sample,fastq_1,fastq_2,single_end
maca_jans_ont,SRR11191910.fastq.gz,,TRUE

If more than one read input type is available, prepare a second (and third) samplesheet with your other input data as follows:

illumina_samplesheet.csv:

sample,fastq_1,fastq_2,single_end
maca_jans_ill,SRR11191912_1.fastq.gz,SRR11191912_2.fastq.gz,FALSE

pb_hifi_samplesheet.csv:

sample,fastq_1,fastq_2,single_end
maca_jans_pb,SRR11191909.fastq.gz,,TRUE

!!! PLEASE ADD "ont", "pb", AND/OR "ill" TO YOUR SAMPLES NAMES !!! Failure to do so will result in assemblers not recognizing your read type.

Additionally, the sample name inputted in your samplesheet will serve as the prefix for your output files. Please indicate which kind of read is being inputted in the sample name. Failure to do so may result in outputs being overwritten.

After you have your samplesheet(s), create a params.yaml file to specify the paths to your samplesheet(s), contaminant databases, etc. Most likely, a config file will also need to be made to modify the default settings of the pipeline. Please look through the nextflow.config file to browse the defaults and specify which you would like to change in your my_config file. More information is located in the usage section.

Now, you can run the pipeline using:

nextflow run emilytrybulec/argonaut \
  -r main \
  -params-file params.yaml \
  -c my_config \
  -profile singularity,xanadu \

Pipeline output

All of the output from the programs run in the pipeline pipeline will be located in the out directory specified in params.yaml. The pipeline produces the following labeled directories depending on configurations:

├── 01 READ QC
│   ├── centrifuge
│   ├── fastp
│   ├── fastqc
│   ├── genome size est
│   │   ├── genomescope2
│   │   ├── jellyfish
│   │   ├── ont gce 
│   │   ├── ont kmerfreq
│   ├── kraken2
│   ├── nanoplot
│   ├── pacbio cutadapt
├── 02 ASSEMBLY
│   ├── hybrid
│   ├── long read
│   ├── short read
├── 03 POLISH
│   ├── hybrid
│   │   ├── polca
│   ├── long read
│   │   ├── medaka
│   │   ├── racon
├── 04 PURGE
│   ├── align
│   ├── histogram
│   ├── purge haplotigs
│   ├── short read redundans
├── 05 SCAFFOLD
├── ASSEMBLY QC
│   ├── busco
│   ├── bwamem2
│   ├── merqury
│   ├── minimap2
│   ├── quast
│   ├── samtools
├── OUTPUT
│   ├── blobtools visualization
│   ├── coverage
│   ├── genome size estimation
│   ├── *assemblyStats.txt
├── PIPELINE INFO
    └── execution_trace_*.txt

Some output files have labels such as "dc", indicating that the reads have been decontaminated, or "lf", indicating that reads have been length filtered.

Information about interpreting output is located in the output section.

Credits

emilytrybulec/genomeassembly was originally written by Emily Trybulec.

I thank the following people for their extensive assistance in the development of this pipeline:

University of Connecticut:

Contributions and Support

Development of this pipeline was funded by the University of Connecticut Office of Undergraduate Research through the Summer Undergraduate Research Fund (SURF) Grant.

The Biodiversity and Conservation Genomics Center is a part of the Earth Biogenome Project, working towards capturing the genetic diversity of life on Earth.

Citations

Argonaut is currently unpublished. For now, please use the GitHub URL when referencing.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.