alexmsalmeida / metamap

Snakemake workflow to map and quantify microbial genomes in metagenomic reads
MIT License
9 stars 4 forks source link

metaMap - quantifying genomes in metagenomes

Snakemake workflow to map and quantify prevalence and abundance of genomes in metagenomes.

Paired metagenomic reads are mapped to a genome database with BWA and filtered based on nucleotide identity, mapping score and alignment fraction with SAMtools. Read counts, coverage and depth are calculated per genome for each metagenome and final summary files are generated across samples. Some of these concepts are discussed in great detail by Matt Olm (developer of dRep and inStrain) here.

Installation

  1. Install conda and snakemake

  2. Clone repository

    git clone https://github.com/alexmsalmeida/metamap.git

How to run

  1. Edit config.yml file to point to the input, output and database directories/files. Input file should contain tab-separated columns with the paths to the forward and reverse reads of each metagenome to map (ending in _1.fastq.gz and _2.fastq.gz, respectively). The database folder should point to a multi-FASTA file containing the reference genomes, indexed with BWA (bwa index). If mapping to a reference database where each sequence represents its own genome (i.e, when mapping against a viral sequence catalog) change the option type in the config.yml from "contigs" to "complete".

Note: If using the "contigs" option with a combined set of reference genomes make sure the names of the contigs in the final FASTA file are structured as [genome_name]_1, [genome_name]_2, etc. You can use the rename_multifasta_prefix.py script here provided in the tools/ folder to rename each genome FASTA before concatenating the files into one.

  1. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)
    snakemake --use-conda -k -j 4
  2. (option 2) Run the pipeline on a cluster (e.g., SLURM)
    snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A {cluster.project} -p {cluster.queue} --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output} --time={cluster.time}'

Output

The main output is located in the directory summary/ which contains four files: