Snakemake workflow to map and quantify prevalence and abundance of genomes in metagenomes.
Paired metagenomic reads are mapped to a genome database with BWA and filtered based on nucleotide identity, mapping score and alignment fraction with SAMtools. Read counts, coverage and depth are calculated per genome for each metagenome and final summary files are generated across samples. Some of these concepts are discussed in great detail by Matt Olm (developer of dRep and inStrain) here.
config.yml
file to point to the input, output and database directories/files. Input file should contain tab-separated columns with the paths to the forward and reverse reads of each metagenome to map (ending in _1.fastq.gz
and _2.fastq.gz
, respectively). The database folder should point to a multi-FASTA file containing the reference genomes, indexed with BWA (bwa index
). If mapping to a reference database where each sequence represents its own genome (i.e, when mapping against a viral sequence catalog) change the option type
in the config.yml
from "contigs" to "complete". Note: If using the "contigs" option with a combined set of reference genomes make sure the names of the contigs in the final FASTA file are structured as [genome_name]_1, [genome_name]_2, etc. You can use the rename_multifasta_prefix.py
script here provided in the tools/
folder to rename each genome FASTA before concatenating the files into one.
-j
based on the number of available cores)
snakemake --use-conda -k -j 4
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A {cluster.project} -p {cluster.queue} --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output} --time={cluster.time}'
The main output is located in the directory summary/
which contains four files:
bwa_counts-total.csv
: read counts per genome across all samples including multi-mapped reads.bwa_counts-unique.csv
: read counts per genome across all samples excluding multi-mapped reads.bwa_cov-est.csv
: breadth of coverage per genome across all samples.bwa_cov-exp.csv
: expected breadth of coverage per genome across all samples based on their level of read depth. This calculation is further discussed here.