metaMap - quantifying genomes in metagenomes

Snakemake workflow to map and quantify prevalence and abundance of genomes in metagenomes.

Paired metagenomic reads are mapped to a genome database with BWA and filtered based on nucleotide identity, mapping score and alignment fraction with SAMtools. Read counts, coverage and depth are calculated per genome for each metagenome and final summary files are generated across samples. Some of these concepts are discussed in great detail by Matt Olm (developer of dRep and inStrain) here.

Installation

Install conda and snakemake

Clone repository

git clone https://github.com/alexmsalmeida/metamap.git

How to run

Edit config.yml file to point to the input, output and database directories/files. Input file should contain tab-separated columns with the paths to the forward and reverse reads of each metagenome to map (ending in _1.fastq.gz and _2.fastq.gz, respectively). The database folder should point to a multi-FASTA file containing the reference genomes, indexed with BWA (bwa index). If mapping to a reference database where each sequence represents its own genome (i.e, when mapping against a viral sequence catalog) change the option type in the config.yml from "contigs" to "complete".

Note: If using the "contigs" option with a combined set of reference genomes make sure the names of the contigs in the final FASTA file are structured as [genome_name]_1, [genome_name]_2, etc. You can use the rename_multifasta_prefix.py script here provided in the tools/ folder to rename each genome FASTA before concatenating the files into one.

(option 1) Run the pipeline locally (adjust -j based on the number of available cores)
```
snakemake --use-conda -k -j 4
```

(option 2) Run the pipeline on a cluster (e.g., SLURM)

snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A {cluster.project} -p {cluster.queue} --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output} --time={cluster.time}'

Output

The main output is located in the directory summary/ which contains four files:

bwa_counts-total.csv: read counts per genome across all samples including multi-mapped reads.
bwa_counts-unique.csv: read counts per genome across all samples excluding multi-mapped reads.
bwa_cov-est.csv: breadth of coverage per genome across all samples.
bwa_cov-exp.csv: expected breadth of coverage per genome across all samples based on their level of read depth. This calculation is further discussed here.

alexmsalmeida / metamap

readme

metaMap - quantifying genomes in metagenomes

Installation

How to run

Output