snakemake-ont-bacterial-variants
A Snakemake workflow for the identification of variants in bacterial genomes using nanopore long-read sequencing.
Usage
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
Workflow overview
This workflow provides a simple and easy-to-use framework for the identification of structural and small nucleotide variants in bacterial genomes using nanopore long-read sequencing data.
The snakemake-ont-bacterial-variants
workflow is built using snakemake and consists of the following steps:
- Quality check of sequencing data (
FastQC
)
- Filtering of input sequencing data by read length and quality (
Filtlong
)
- Mapping to reference genome (
NGMLR
)
- Calling of structural and single nucleotide variants (SVs:
cuteSV
and Sniffles2
; SNVs: Clair3
and Medaka
)
- Filtering of identified variants (e.g., by variant quality or genomic regions;
BCFtools
and VCFtools
)
- Generate report with final results (
R markdown
, igv-reports
, and MultiQC
)
If you would like to contribute, report issues, or suggest features, please get in touch on GitHub.
Installation
Snakemake
Step 1: Install snakemake with conda
in a new conda environment.
conda create -n <ENV> snakemake pandas
Step 2: Activate conda environment with snakemake
conda activate <ENV>
Additional tools
Important note:
All other dependencies for the workflow are automatically pulled as conda
environments by snakemake, when running the workflow with the --use-conda
parameter (recommended).
When run without automatically built conda
environments, all packages need to be installed manually:
NanoPlot
MultiQC
Filtlong
NGMLR
minimap2
samtools
bedtools
Medaka
Clair3
cuteSV
Sniffles2
bcftools
VCFtools
r-tidyverse
r-rmarkdown
r-dt
igv-reports
Running the workflow
Input data
The workflow requires the following files to be located in the data
directory:
- Whole-genome sequencing data in
*.fastq.gz
format in data/fastq
- Reference genome(s) in
*.fa
format in data/reference
Optionally, users can provide:
- Reference genome annotation in
*.gff
format in data/annotation
(for feature annotation in IGV report)
- A
*.bed
file with genomic regions to ignore for variant calling in data/masked_region
Please ensure that the chromosome names in *.gff
and *.bed
files are the same as in the reference genome.
Input data files are provided in the samples.tsv
table, whose location is inidcated in the config.yml
file. The samplesheet must comply with the following structure:
sample
defines the sample name that will be used throughout the workflow and thus needs to be unique.
fastq
provides the path to the sample's *.fastq.gz
file.
reference
provides the path to the reference genome *.fa
file (may be the same for several / all samples).
annotation
provides the path to the optional reference genome annotation in *.gff
file (may be the same for several / all samples). If no annotation is provided, you must enter n/a
!
masked_regions
provides the path to the optional *.bed
file for filtering genomic regions (may be the same for several / all samples). If no *.bed
file is provided, you must enter n/a
!
sample |
fastq |
reference |
annotation |
masked_regions |
\ |
data/fastq/\.fastq.gz |
data/reference/\.fa |
data/annotation/\.gff |
data/masked_region/\.bed |
\ |
data/fastq/\.fastq.gz |
data/reference/\.fa |
data/annotation/\.gff |
data/masked_region/\.bed |
... |
... |
... |
... |
... |
\ |
data/fastq/\.fastq.gz |
data/reference/\.fa |
data/annotation/\.gff |
data/masked_region/\.bed |
Configuration and parameters
Before executing the workflow, you may want to adjust several options and parameters in the default config file config/config.yml
:
- Directories:
indir
: Input directory for all input files, data
by default (see above)
outdir
: Output directory (relative to working directory), results
by default
- Sample information:
samples
: Path to samplesheet (relative to working directory), samplesheet/samples.tsv
by default
libprepkit
: Kit from ONT used for library preparation, e.g. SQK-NBD114.24
basecalling_model
: Model used for basecalling of raw sequencing data (required for variant calling using Medaka
), currently supported models are:
r1041_e82_400bps_sup_v4.2.0
r1041_e82_400bps_sup_v4.3.0
- Tool parameters:
- The number of cores can be adjusted here for the following tools:
NGMLR
, NanoPlot
, MultiQC
, Medaka
, Clair3
, Sniffles2
, and cuteSV
- You may further adjust the run parameters for the following tools (please refer to the reference provided for more details on run parameters):
Filtlong
: By default, reads are filtered for a minimum length of 500 bp and a mean accuracy of at least 90% (Q10), with 90% of the longest and highest-quailty reads to be kept.
Clair3
: Variants are called on all contigs in a haploid-sensitive, ONT-specific mode using --include_all_ctgs --haploid_sensitive --platform ont
.
cuteSV
: Variants are called with the suggested parameters for ONT data (--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3
) and the genotyping option enabled (--genotype
).
- Filtering of variants:
- The variant quality thresholds can be adjusted here for all four variant callers
remove_common_variants
: If True
, variants which have been identified in all samples with the same reference genome by one tool are filtered out. This is helpful in case all samples derive from a strain, whose genome sequence already differs from the used reference sequence. If False
, all variants are reported.
- Reporting options:
igv_region_length
: Neighboring variants with a maximum bp distance indicated here [1 by default] are reported in one region in the IGV variant report. Increasing this parameter will reduce the file size of the resulting IGV HTML report, if hotspots / regions with many variants exist in a sample.
Execution
To run the workflow from the command line, change the working directory:
cd /path/to/snakemake-ont-bacterial-variants
Before running the entire workflow, you may perform a dry run using:
snakemake --dry-run
To run the complete workflow with your files using conda
, execute the following command (the definition of the number of cores is mandatory):
snakemake --cores 10 --use-conda
You may also run the workflow on the provided test data using:
snakemake --cores 10 --use-conda --directory .test
Output
Main output
The most important output files and reports are found in variant_reports
for each sample in a separate sub-directory variant_reports/<sample>/
:
<sample>_overview.html
: Custom overview HTML report comprising read and mapping statistics as well as identified variants.
<sample>_IGV.html
: HTML report with alignment data for all variants listed in <sample>_overview.html
(generated with igv-reports
).
<sample>.<tool>.vcf
: File with all variants for each tool used (after filtering).
Further, variants_reports
contains MultiQC.html
, an interactive HTML report generated by MultiQC
with read statistics on raw, filtered and aligned reads.
Log files from the generation of above reports can be found in variant_reports/logs/
:
<sample>.collect_vcfs.log
: Copying of variant files to above destinations
<sample>.igv_reports.log
: Generation of IGV HTML report
<sample>.prepare_vcfs.log
: Merging of neighboring variants in genomic regions for igv-reports
<sample>.report.log
: Generation of final overview HTML report using R Markdown
Additional output files
In addition, the workflow generates the following output files in the corresponding directories:
filtered_reads
- `.fastq.gz`: Length- and quality-filtered reads
- `logs/.log`: Filtering log file(s)
mapping
- `.bam`: Alignment file of filtered reads to sample-specific reference genome (produced by `NGMLR`)
- `.bam.bai`: Alignment index file
- `logs/.*.log`: Log files for `NGMLR`, `samtools`, and `bedtools` (calculation of genome coverage and read end distributions [as temporary files only])
qc
Read statistics on all raw, filtered and aligned reads are found here (generated with `NanoPlot`):
- `raw_reads//`: Directory with output files with read statistics on all **raw reads** (log files in `filtered_reads/logs/.log`).
- `filtered_reads//`: Directory with output files with read statistics on all **filtered reads** (log files in `filtered_reads/logs/.log`).
- `aligned_reads//`: Directory with output files with read statistics on all **aligned reads** (log files in `aligned_reads/logs/.log`).
The raw output from `MultiQC` - based on above `NanoPlot` results - is located in the directory `multiqc`, containing the interactive HTML report `multiqc/multiqc_report.html`.
SNV
For both tools (`medaka` and `clair3`):
- `//`: Directory with all **raw** output files from variant calling tool
- `/.vcf`: File with identified variants
- `/.filtered.vcf`: File with variants filtered by variant quality, overlap with other samples from same reference genome (only if `remove_common_variants: True` in `config.yml`), and by genomic region if provided (compare `masked_region` in samplesheet)
- `/logs/.*`: Standard error (`stderr`) and standard output (`stdout`) files for the identification of variants using either `medaka` or `clair3`
- `/logs/_filtering.log`: Log file produced by `VCFtools` for final filtering step to generate `/.filtered.vcf`
If `remove_common_variants: True` in `config.yml`, the following files are additionally produced:
- `/common_variants.vcf`: File with variants called by the tool which are shared by all samples with the same reference genome
- `/common_variants.txt`: Tab-delimited file deduced from `/common_variants.vcf` listing contig and variant start position only
- `/logs/_indexing.*`: Standard error (`stderr`) and standard output (`stdout`) for indexing of `*.vcf` files using `bcftools`
- `/logs/bcftools.*`: Standard error (`stderr`) and standard output (`stdout`) for identification of shared variants across samples uisng `bcftools`
For `clair3`, data from the downloaded model is additionally found in the directory `clair3/model/`.
SV
For both tools (`cutesv` and `sniffles2`):
- `/.vcf`: File with identified variants
- `/.filtered.vcf`: File with variants filtered by variant quality, overlap with other samples from same reference genome (only if `remove_common_variants: True` in `config.yml`), and by genomic region if provided (compare `masked_region` in samplesheet)
- `/logs/.log`: Log file for the identification of variants using either `cutesv` or `sniffles2`
- `/logs/_filtering.log`: Log file produced by `VCFtools` for final filtering step to generate `/.filtered.vcf`
If `remove_common_variants: True` in `config.yml`, the following files are additionally produced:
- `/common_variants.vcf`: File with variants called by the tool which are shared by all samples with the same reference genome
- `/common_variants.txt`: Tab-delimited file deduced from `/common_variants.vcf` listing contig and variant start position only
- `/logs/_indexing.*`: Standard error (`stderr`) and standard output (`stdout`) for indexing of `*.vcf` files using `bcftools`
- `/logs/bcftools.*`: Standard error (`stderr`) and standard output (`stdout`) for identification of shared variants across samples uisng `bcftools`
For `cutesv`, all **raw** output files from the initial variant calling can be found in the directory `cutesv//`.
Authors
Dr. Thomas Fabian Wulff
Please also visit the MPUSP GitHub page at https://github.com/MPUSP for further information on this workflow and other projects!
References
The following software and tools have been used in this workflow:
BCFtools
> Danecek, P., Bonfield, J.K., Liddle, J. et al. _Twelve years of SAMtools and BCFtools_. GigaScience 10(2), giab008, 2021. (https://doi.org/10.1093/gigascience/giab008)
bedtools
> Quinlan, A.R., Hall, I.M., _BEDTools: a flexible suite of utilities for comparing genomic features_. Bioinformatics 26(6), 841-842, 2010. ( https://doi.org/10.1093/bioinformatics/btq033)
Clair3
> Zheng, Z., Li, S., Su, J. et al. _Symphonizing pileup and full-alignment for deep learning-based long-read variant calling_. Nature Computational Science 2, 797–803, 2022. (https://doi.org/10.1038/s43588-022-00387-x)
cuteSV
> Jiang, T., Liu, Y., Jiang, Y. et al. _Long-read-based human genomic structural variation detection with cuteSV_. Genome Biology 21, 189, 2020. (https://doi.org/10.1186/s13059-020-02107-y)
Filtlong
> https://github.com/rrwick/Filtlong
igv-reports
> https://github.com/igvteam/igv-reports
Medaka
> https://github.com/nanoporetech/medaka
MultiQC
> Ewels, P., Magnusson, M., Lundin, S., et al. _MultiQC: summarize analysis results for multiple tools and samples in a single report_. Bioinformatics 32(19) 3047–3048, 2016. (https://doi.org/10.1093/bioinformatics/btw354)
NanoPlot
> Coster, W.D., Rademakers, R. _NanoPack2: population-scale evaluation of long-read sequencing data_. Bioinformatics 39(5), btad311, 2023. (https://doi.org/10.1093/bioinformatics/btad311)
NGMLR
> Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. _Accurate detection of complex structural variations using single-molecule sequencing_. Nature Methods 15, 461–468, 2018. (https://doi.org/10.1038/s41592-018-0001-7)
SAMtools
> Li, H., Handsaker, B., Wysoker, A. et al. _The Sequence Alignment/Map format and SAMtools_. Bioinformatics 25(16), 2078–2079, 2009. (https://doi.org/10.1093/bioinformatics/btp352)
Snakemake
> Mölder, F., Jablonski, K.P., Letcher, B. et al. _Sustainable data analysis with Snakemake_. F1000Research 10:33, 2021. (https://doi.org/10.12688/f1000research.29032.2)
Sniffles2
> Smolka, M., Paulin, L.F., Grochowski, C.M. et al. _Detection of mosaic and population-level structural variants with Sniffles2_. Nature Biotechnology 2024. (https://doi.org/10.1038/s41587-023-02024-y)
VCFtools
> Danecek, P., Auton, A., Abecasis, G. et al. _The variant call format and VCFtools_. Bioinformatics 27(15), 2156–2158, 2011. (https://doi.org/10.1093/bioinformatics/btr330)