gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
82 stars 7 forks source link

Open In Colab

Paper

License: MIT GitHub last commit (branch) Code DOI

Anaconda-Server Badge Bioconda Downloads PyPI version Downloads

Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies

hybracter is an automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.

Table of Contents

Quick Start

Google Colab Notebooks

If you don't want to install hybracter locally, you can run it without any code using the colab notebook https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb

This is only recommend if you have one or a few samples to assemble (it takes a while per sample due to the limited nature of Google Colab resources - probably an hour or two a sample). If you have more than this, a local install as described below is suggested.

Mamba/Conda

hybracter is available to install with pip or conda.

You will need conda or mamba available so hybracter can install all the required dependencies.

Therefore, it is recommended to install hybracter into a conda environment as follows.

mamba create -n hybracterENV -c bioconda -c conda-forge  hybracter
conda activate hybracterENV
hybracter --help
hybracter install

Mamba is highly highly recommend. Please see the documentation for more details on how to install mamba.

When you run hybracter for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependenices will be installed.

If you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running hybracter test-hybrid and/or hybracter test-long on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes. If your computer/node has internet access, please skip this step.

hybracter test-hybrid --threads 8
hybracter test-long --threads 8

Container

Alternatively, a Docker/Singularity Linux container image is available for Hybracter (starting from v0.7.1) here. This will likely be useful for running Hybracter in HPC environments.

To install and run v0.7.3 with singularity


IMAGE_DIR="<the directory you want the .sif file to be in >"
singularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/hybracter:0.7.3

containerImage="$IMAGE_DIR/hybracter_0.7.3.sif"

# example command with test fastqs
 singularity exec $containerImage    hybracter hybrid-single -l test_data/Fastqs/test_long_reads.fastq.gz \
 -1 test_data/Fastqs/test_short_reads_R1.fastq.gz  -2 test_data/Fastqs/test_short_reads_R2.fastq.gz \
 -o output_test_singularity -t 4 -c 50000

Documentation

Documentation for hybracter is available here.

Manuscript

hybracter has recently been published in Microbial Genomics

Description

hybracter is designed for assembling bacterial isolate genomes using a long read first assembly approach. It scales massively using the embarassingly parallel power of HPC and Snakemake profiles. It is designed for applications where you have isolates with Oxford Nanopore Technologies (ONT) long reads and optionally matched paired-end short reads for polishing.

hybracter is desined to straddle the fine line between being as fully feature-rich as possible with as much information as you need to decide upon the best assembly, while also being a one-line automated program. In other words, as awesome as Unicycler, but updated for 2023. Perfect for lazy people like myself.

hybracter is largely based off Ryan Wick's magnificent tutorial and associated paper. hybracter differs in that it adds some additional steps regarding targeted plasmid assembly with plassembler, contig reorientation with dnaapler and extra polishing and statistical summaries.

Note: if you have Pacbio reads, as of 2023, you can run hybracter long with --no_medaka to turn off polishing, and --flyeModel pacbio-hifi. You can also probably just run Flye or Dragonflye (or of course Trycyler ) and reorient the contigs with dnaapler without polishing. See Ryan Wick's blogpost for more details.

Pipeline

Hybracter

Benchmarking

hybracter was benchmarked in both hybrid and long modes (specifically using the hybrid-single and long-single commands) against Unicycler v0.5.0 and Dragonflye v1.1.2.

30 samples from 5 studies with available reference genomes were benchmarked. You can see the full explanation and results here. You can find all the output here.

To summarise the conclusions:

v0.7.0 Updates (04 March 2024)

Changes to short read polishing

--logic changes

Changes for chromosome contigs and circularity

Adds --depth_filter

v0.5.0 Updates (08 January 2024)

Ryan Wick recently ran hybracter long on the latest Dorado v0.5.0 basecalled Nanopore reads (his blog post). You can read a write-up of the results here. As a result, subsampling has been added to Hybracter.

v0.4.0 Updates (14 November 2023)

v0.2.0 Updates 26 October 2023 - Medaka, Polishing and --no_medaka

Ryan Wick's blogpost on 24 October 2023 suggests that if you have new 5Hz SUP or Res (bacterial model specific) ONT reads, Medaka polishing often makes things worse! It also implies that Nanopore reads are almost good enough to assemble perfect bacterial genomes (at least with Trycycler) which is pretty awesome.

Combined with the difficulty and randomness in installing Medaka from Nanopore, I have therefore decided to add a --no_medaka flag into v0.2.0.

I have also set Medaka to be v1.8.0 and I do not intend to upgrade this going forward, as this is the most recent stable bioconda version that doesn't seem to cause too much grief.

If you have trouble with Medaka installation, I'd therefore suggest please using --no_medaka.

hybracter should still handle cases where Medaka makes assemblies worse. If Medaka makes your assembly appreciably worse, hybracter should choose the best most accurate assembly as the unpolished one in long mode.

Why Would You Run Hybracter?

Other Options

Trycycler

If you are looking for the best possible (manual) bacterial assembly for a single isolate, please definitely use Trycyler.

Dragonflye

Dragonflye by the awesome @rpetit3 is a good alternative for automated assembly if hybracter doesn't fit your needs, particuarly if you are familiar with Shovill. Some pros and cons between hybracter and dragonflye are listed below.

Installation

You will need conda and highly recommended mamba to run hybracter, because it is required for the installation of each compartmentalised environment (e.g. Flye will have its own environment). Please see the documentation for more details on how to install mamba.

Conda

hybracter is available to install with conda. To install hybracter into a conda enviornment called hybracterENV:

mamba create -n hybracterENV hybracter
conda activate hybracterENV
hybracter --help
hybracter install

Pip

hybracter is available to install with pip .

You will also need conda or mamba available so hybracter can install all the required dependencies. Therefore, it is recommended to install hybracter into a conda environment as follows.

mamba create -n hybracterENV pip
conda activate hybracterENV
pip install hybracter
hybracter --help
hybracter install

Source

Alternatively, the development version of hybracter (which may include new, untested features) can be installed manually via github.

git clone https://github.com/gbouras13/hybracter.git
cd hybracter
pip install -e .
hybracter --help

Main Commands

 _           _                    _            
| |__  _   _| |__  _ __ __ _  ___| |_ ___ _ __ 
| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|
| | | | |_| | |_) | | | (_| | (__| ||  __/ |   
|_| |_|\__, |_.__/|_|  \__,_|\___|\__\___|_|   
       |___/

Usage: hybracter [OPTIONS] COMMAND [ARGS]...

  For more options, run: hybracter command --help

Options:
  -h, --help  Show this message and exit.

Commands:
  install        Downloads and installs the plassembler database
  hybrid         Run hybracter with hybrid long and paired end short reads
  hybrid-single  Run hybracter hybrid on 1 isolate
  long           Run hybracter with only long reads
  long-single    Run hybracter long on 1 isolate
  test-hybrid    Test hybracter hybrid
  test-long      Test hybracter long
  config         Copy the system default config file
  citation       Print the citation(s) for hybracter
  version        Print the version for hybracter

Input csv

hybracter hybrid and hybracter long require an input csv file to be specified with --input. No other inputs are required.

hybracter hybrid

e.g.

s_aureus_sample1,sample1_long_read.fastq.gz,2500000,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz

hybracter long

hybracter long also requires an input csv with no headers, but only 3 columns.

e.g.

s_aureus_sample1,sample1_long_read.fastq.gz,2500000
p_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000

Usage

hybracter install

You will first need to install the hybracter databases.

hybracter install

Alternatively, can also specify a particular directory to store them - you will need to specify this with -d <databases directory> when you run hybracter.

hybracter install -d  <databases directory>

Installing Dependencies

If you have internet access on the machine or node where you are running hybracter, you can skip this step.

When you run hybracter for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependenices will be installed.

If you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running hybracter test-hybrid and/or hybracter test-long on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes.

hybracter test-hybrid 
hybracter test-long
hybracter --help

Once that is done, run hybracter hybrid or hybracter long as follows.

hybracter hybrid

hybracter hybrid -i <input.csv> -o <output_dir> -t <threads> 

hybracter hybrid-single

hybracter hybrid-single -l <longread FASTQ> -1 <R1 short reads FASTQ> -2 <R2 short reads FASTQ> -s <sample name> -c <chromosome size> -o <output_dir> -t <threads>  [other arguments]

hybracter long

hybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]

hybracter long-single

hybracter long-single -l <longread FASTQ> -s <sample name> -c <chromosome size>  -o <output_dir> -t <threads>  [other arguments]

Outputs

hybracter creates a number of output files in different formats.

For more information about all possible file outputs, please see the documentation here.

Main Output Files

The main outputs are in the FINAL_OUTPUT directory.

This directory will include:

  1. hybracter_summary.tsv file. This gives the summary statistics for your assemblies with the following columns:
Sample Complete (True or False) Total_assembly_length Number_of_contigs Most_accurate_polishing_round Longest_contig_length Longest_contig_coverage Number_circular_plasmids
  1. complete and incomplete directories.

All samples that are denoted by hybracter to be complete will have 5 outputs in the complete directory:

All samples that are denoted by hybracter to be incomplete will have 3 outputs in the incomplete directory:

Snakemake Profiles

I would highly highly recommend running hybracter using a Snakemake profile. Please see this blog post for more details. I have included an example slurm profile in the profile directory, but check out this link for more detail on other HPC job scheduler profiles.

hybracter hybrid --input <input.csv> --output <output_dir> --threads <threads> --profile profiles/hybracter

Advanced Configuration

Thanks to its Snakemake backend, you can modify resource requirements for each job contained within hybracter using the configuration file. A defauly can be created using the hybracter config command. This can make it even more efficient in server environment, as many jobs can be more efficiently parallelised than the default settings. For more information, please see the documentation

Version Log

A brief description of what is new in each update of hybracter can be found in the HISTORY.md file.

System

hybracter is tested on Linux and on MacOS.

Bugs and Suggestions

If you come across bugs with hybracter, or would like to make any suggestions to improve the program, please open an issue or email george.bouras@adelaide.edu.au.

Citation

If you use Hybracter, please cite the manuscript along with core dependencies (they are also our tools!):

Hybracter Manuscript

Plassembler:

Dnaapler:

Ryan Wick et al's Assembling the perfect bacterial genome paper, which provided the intellectual framework for hybracter:

I would also recommend citing Hybracter's other dependencies if you can where they are used:

Flye:

Snaketool:

Trimnami:

Filtlong:

Porechop and Porechop_abi:

fastp:

ALE:

Medaka:

Pyrodigal:

Polypolish:

Pypolca:

Snakemake: