snakemake-ont-bacterial-variants

A Snakemake workflow for the identification of variants in bacterial genomes using nanopore long-read sequencing.

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).

Workflow overview

This workflow provides a simple and easy-to-use framework for the identification of structural and small nucleotide variants in bacterial genomes using nanopore long-read sequencing data. The snakemake-ont-bacterial-variants workflow is built using snakemake and consists of the following steps:

Quality check of sequencing data (FastQC)
Filtering of input sequencing data by read length and quality (Filtlong)
Mapping to reference genome (NGMLR)
Calling of structural and single nucleotide variants (SVs: cuteSV and Sniffles2; SNVs: Clair3 and Medaka)
Filtering of identified variants (e.g., by variant quality or genomic regions; BCFtools and VCFtools)
Generate report with final results (R markdown, igv-reports, and MultiQC)

If you would like to contribute, report issues, or suggest features, please get in touch on GitHub.

Installation

Snakemake

Step 1: Install snakemake with conda in a new conda environment.

conda create -n <ENV> snakemake pandas

Step 2: Activate conda environment with snakemake

conda activate <ENV>

Additional tools

Important note:

All other dependencies for the workflow are automatically pulled as conda environments by snakemake, when running the workflow with the --use-conda parameter (recommended).

When run without automatically built conda environments, all packages need to be installed manually:

NanoPlot
MultiQC
Filtlong
NGMLR
minimap2
samtools
bedtools
Medaka
Clair3
cuteSV
Sniffles2
bcftools
VCFtools
r-tidyverse
r-rmarkdown
r-dt
igv-reports

Running the workflow

Input data

The workflow requires the following files to be located in the data directory:

Whole-genome sequencing data in *.fastq.gz format in data/fastq
Reference genome(s) in *.fa format in data/reference

Optionally, users can provide:

Reference genome annotation in *.gff format in data/annotation (for feature annotation in IGV report)
A *.bed file with genomic regions to ignore for variant calling in data/masked_region

Please ensure that the chromosome names in *.gff and *.bed files are the same as in the reference genome.

Input data files are provided in the samples.tsv table, whose location is inidcated in the config.yml file. The samplesheet must comply with the following structure:

sample defines the sample name that will be used throughout the workflow and thus needs to be unique.
fastq provides the path to the sample's *.fastq.gz file.
reference provides the path to the reference genome *.fa file (may be the same for several / all samples).
annotation provides the path to the optional reference genome annotation in *.gff file (may be the same for several / all samples). If no annotation is provided, you must enter n/a!
masked_regions provides the path to the optional *.bed file for filtering genomic regions (may be the same for several / all samples). If no *.bed file is provided, you must enter n/a!

sample	fastq	reference	annotation	masked_regions
\	data/fastq/\.fastq.gz	data/reference/\.fa	data/annotation/\.gff	data/masked_region/\.bed
\	data/fastq/\.fastq.gz	data/reference/\.fa	data/annotation/\.gff	data/masked_region/\.bed
...	...	...	...	...
\	data/fastq/\.fastq.gz	data/reference/\.fa	data/annotation/\.gff	data/masked_region/\.bed

Configuration and parameters

Before executing the workflow, you may want to adjust several options and parameters in the default config file config/config.yml:

Directories:
- indir: Input directory for all input files, data by default (see above)
- outdir: Output directory (relative to working directory), results by default
Sample information:
- samples: Path to samplesheet (relative to working directory), samplesheet/samples.tsv by default
- libprepkit: Kit from ONT used for library preparation, e.g. SQK-NBD114.24
- basecalling_model: Model used for basecalling of raw sequencing data (required for variant calling using Medaka), currently supported models are:
  - r1041_e82_400bps_sup_v4.2.0
  - r1041_e82_400bps_sup_v4.3.0
Tool parameters:
- The number of cores can be adjusted here for the following tools: NGMLR, NanoPlot, MultiQC, Medaka, Clair3, Sniffles2, and cuteSV
- You may further adjust the run parameters for the following tools (please refer to the reference provided for more details on run parameters):
  - Filtlong: By default, reads are filtered for a minimum length of 500 bp and a mean accuracy of at least 90% (Q10), with 90% of the longest and highest-quailty reads to be kept.
  - Clair3: Variants are called on all contigs in a haploid-sensitive, ONT-specific mode using --include_all_ctgs --haploid_sensitive --platform ont.
  - cuteSV: Variants are called with the suggested parameters for ONT data (--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3) and the genotyping option enabled (--genotype).
Filtering of variants:
- The variant quality thresholds can be adjusted here for all four variant callers
- remove_common_variants: If True, variants which have been identified in all samples with the same reference genome by one tool are filtered out. This is helpful in case all samples derive from a strain, whose genome sequence already differs from the used reference sequence. If False, all variants are reported.
Reporting options:
- igv_region_length: Neighboring variants with a maximum bp distance indicated here [1 by default] are reported in one region in the IGV variant report. Increasing this parameter will reduce the file size of the resulting IGV HTML report, if hotspots / regions with many variants exist in a sample.

Execution

To run the workflow from the command line, change the working directory:

cd /path/to/snakemake-ont-bacterial-variants

Before running the entire workflow, you may perform a dry run using:

snakemake --dry-run

To run the complete workflow with your files using conda, execute the following command (the definition of the number of cores is mandatory):

snakemake --cores 10 --use-conda

You may also run the workflow on the provided test data using:

snakemake --cores 10 --use-conda --directory .test

Output

Main output

The most important output files and reports are found in variant_reports for each sample in a separate sub-directory variant_reports/<sample>/:

<sample>_overview.html: Custom overview HTML report comprising read and mapping statistics as well as identified variants.
<sample>_IGV.html: HTML report with alignment data for all variants listed in <sample>_overview.html (generated with igv-reports).
<sample>.<tool>.vcf: File with all variants for each tool used (after filtering).

Further, variants_reports contains MultiQC.html, an interactive HTML report generated by MultiQC with read statistics on raw, filtered and aligned reads.

Log files from the generation of above reports can be found in variant_reports/logs/:

<sample>.collect_vcfs.log: Copying of variant files to above destinations
<sample>.igv_reports.log: Generation of IGV HTML report
<sample>.prepare_vcfs.log: Merging of neighboring variants in genomic regions for igv-reports
<sample>.report.log: Generation of final overview HTML report using R Markdown

Additional output files

In addition, the workflow generates the following output files in the corresponding directories:

filtered_reads

- `.fastq.gz`: Length- and quality-filtered reads - `logs/.log`: Filtering log file(s)

mapping

- `.bam`: Alignment file of filtered reads to sample-specific reference genome (produced by `NGMLR`) - `.bam.bai`: Alignment index file - `logs/.*.log`: Log files for `NGMLR`, `samtools`, and `bedtools` (calculation of genome coverage and read end distributions [as temporary files only])

Read statistics on all raw, filtered and aligned reads are found here (generated with `NanoPlot`): - `raw_reads//`: Directory with output files with read statistics on all **raw reads** (log files in `filtered_reads/logs/.log`). - `filtered_reads//`: Directory with output files with read statistics on all **filtered reads** (log files in `filtered_reads/logs/.log`). - `aligned_reads//`: Directory with output files with read statistics on all **aligned reads** (log files in `aligned_reads/logs/.log`). The raw output from `MultiQC` - based on above `NanoPlot` results - is located in the directory `multiqc`, containing the interactive HTML report `multiqc/multiqc_report.html`.

SNV

For both tools (`medaka` and `clair3`): - `//`: Directory with all **raw** output files from variant calling tool - `/.vcf`: File with identified variants - `/.filtered.vcf`: File with variants filtered by variant quality, overlap with other samples from same reference genome (only if `remove_common_variants: True` in `config.yml`), and by genomic region if provided (compare `masked_region` in samplesheet) - `/logs/.*`: Standard error (`stderr`) and standard output (`stdout`) files for the identification of variants using either `medaka` or `clair3` - `/logs/_filtering.log`: Log file produced by `VCFtools` for final filtering step to generate `/.filtered.vcf` If `remove_common_variants: True` in `config.yml`, the following files are additionally produced: - `/common_variants.vcf`: File with variants called by the tool which are shared by all samples with the same reference genome - `/common_variants.txt`: Tab-delimited file deduced from `/common_variants.vcf` listing contig and variant start position only - `/logs/_indexing.*`: Standard error (`stderr`) and standard output (`stdout`) for indexing of `*.vcf` files using `bcftools` - `/logs/bcftools.*`: Standard error (`stderr`) and standard output (`stdout`) for identification of shared variants across samples uisng `bcftools` For `clair3`, data from the downloaded model is additionally found in the directory `clair3/model/`.

For both tools (`cutesv` and `sniffles2`): - `/.vcf`: File with identified variants - `/.filtered.vcf`: File with variants filtered by variant quality, overlap with other samples from same reference genome (only if `remove_common_variants: True` in `config.yml`), and by genomic region if provided (compare `masked_region` in samplesheet) - `/logs/.log`: Log file for the identification of variants using either `cutesv` or `sniffles2` - `/logs/_filtering.log`: Log file produced by `VCFtools` for final filtering step to generate `/.filtered.vcf` If `remove_common_variants: True` in `config.yml`, the following files are additionally produced: - `/common_variants.vcf`: File with variants called by the tool which are shared by all samples with the same reference genome - `/common_variants.txt`: Tab-delimited file deduced from `/common_variants.vcf` listing contig and variant start position only - `/logs/_indexing.*`: Standard error (`stderr`) and standard output (`stdout`) for indexing of `*.vcf` files using `bcftools` - `/logs/bcftools.*`: Standard error (`stderr`) and standard output (`stdout`) for identification of shared variants across samples uisng `bcftools` For `cutesv`, all **raw** output files from the initial variant calling can be found in the directory `cutesv//`.

Authors

Dr. Thomas Fabian Wulff

Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
ORCID: https://orcid.org/0000-0001-7166-0899

Please also visit the MPUSP GitHub page at https://github.com/MPUSP for further information on this workflow and other projects!

References

The following software and tools have been used in this workflow:

BCFtools

> Danecek, P., Bonfield, J.K., Liddle, J. et al. _Twelve years of SAMtools and BCFtools_. GigaScience 10(2), giab008, 2021. (https://doi.org/10.1093/gigascience/giab008)

bedtools

> Quinlan, A.R., Hall, I.M., _BEDTools: a flexible suite of utilities for comparing genomic features_. Bioinformatics 26(6), 841-842, 2010. ( https://doi.org/10.1093/bioinformatics/btq033)

Clair3

> Zheng, Z., Li, S., Su, J. et al. _Symphonizing pileup and full-alignment for deep learning-based long-read variant calling_. Nature Computational Science 2, 797–803, 2022. (https://doi.org/10.1038/s43588-022-00387-x)

cuteSV

> Jiang, T., Liu, Y., Jiang, Y. et al. _Long-read-based human genomic structural variation detection with cuteSV_. Genome Biology 21, 189, 2020. (https://doi.org/10.1186/s13059-020-02107-y)

Filtlong

> https://github.com/rrwick/Filtlong

igv-reports

> https://github.com/igvteam/igv-reports

Medaka

> https://github.com/nanoporetech/medaka

MultiQC

> Ewels, P., Magnusson, M., Lundin, S., et al. _MultiQC: summarize analysis results for multiple tools and samples in a single report_. Bioinformatics 32(19) 3047–3048, 2016. (https://doi.org/10.1093/bioinformatics/btw354)

NanoPlot

> Coster, W.D., Rademakers, R. _NanoPack2: population-scale evaluation of long-read sequencing data_. Bioinformatics 39(5), btad311, 2023. (https://doi.org/10.1093/bioinformatics/btad311)

NGMLR

> Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. _Accurate detection of complex structural variations using single-molecule sequencing_. Nature Methods 15, 461–468, 2018. (https://doi.org/10.1038/s41592-018-0001-7)

SAMtools

> Li, H., Handsaker, B., Wysoker, A. et al. _The Sequence Alignment/Map format and SAMtools_. Bioinformatics 25(16), 2078–2079, 2009. (https://doi.org/10.1093/bioinformatics/btp352)

Snakemake

> Mölder, F., Jablonski, K.P., Letcher, B. et al. _Sustainable data analysis with Snakemake_. F1000Research 10:33, 2021. (https://doi.org/10.12688/f1000research.29032.2)

Sniffles2

> Smolka, M., Paulin, L.F., Grochowski, C.M. et al. _Detection of mosaic and population-level structural variants with Sniffles2_. Nature Biotechnology 2024. (https://doi.org/10.1038/s41587-023-02024-y)

VCFtools

> Danecek, P., Auton, A., Abecasis, G. et al. _The variant call format and VCFtools_. Bioinformatics 27(15), 2156–2158, 2011. (https://doi.org/10.1093/bioinformatics/btr330)

MPUSP / snakemake-ont-bacterial-variants

readme

snakemake-ont-bacterial-variants

Usage

Workflow overview

Installation

Snakemake

Additional tools

Running the workflow

Input data

Configuration and parameters

Execution

Output

Main output

Additional output files

Authors

References