Russel88 / MAGinator

MAGinator - Accurate SNV calling and profiling of MAGs
MIT License
16 stars 1 forks source link

Project Status: Active - The project has reached a stable, usable state and is being actively developed.

MAGinator

Combining the strengths of contig and gene based methods to provide:

Installation

All you need for running MAGinator is snakemake and mamba. Other dependencies will be installed by snakemake automatically.

conda create -n maginator -c bioconda -c conda-forge snakemake mamba
conda activate maginator
pip install maginator

Furthermore, MAGinator also needs the GTDB-tk database downloaded. Here we download release 214. If you don't already have it, you can run the following:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz
tar xvzf *.tar.gz

Usage

MAGinator needs 3 input files:

Run MAGinator:

maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release214/"

A testset can be found in the test_data directory.

  1. Download the 3 samples used for the test at SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=715601 with the ID's dfc99c_A, f9d84e_A and 221641_A
  2. Change the paths to the read-files in reads.csv
  3. Unzip the contigs.fasta.gz
  4. Run MAGinator

Run on a compute cluster

MAGinator can run on compute clusters using qsub (torque), sbatch (slurm), or drmaa structures. The --cluster argument toggles the type of compute cluster infrastructure. The --cluster_info argument toggles the information given to the submission command, and it has to contain the following keywords {cores}, {memory}, {runtime}, which are used to forward resource information to the cluster.

A qsub MAGinator can for example be run with the following command (... indicates required arguments, see above):

maginator ... --cluster qsub --cluster_info "-l nodes=1:ppn={cores}:thinnode,mem={memory}gb,walltime={runtime}"

Test data

A test set can be found in the maginator/test_data directory.

  1. Download the 3 samples used for the test at SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=715601 with the ID's dfc99c_A, f9d84e_A and 221641_A
  2. Clone repo: git clone https://github.com/Russel88/MAGinator.git
  3. Change the paths to the read-files in reads.csv
  4. Unzip the contigs.fasta.gz
  5. Run MAGinator

MAGinator can been run on the test data on a slurm server with the following command:

maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release214/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180

The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036. MAGinator has been run on the test data (using GTDB-tk db release207_v2) on a slurm server.

On the compute cluster each job have had access to 180gb RAM, with the following time consumption: real 72m27.379s user 0m18.830s sys 1m0.454s

If you run on a smaller server you can set the parameters --max_cores and --max_mem.

Recommended workflow

To generate the input files to run MAGinator we have created a recommended workflow, with preprocessing, assembly and binning* of your metagenomics reads (the rules for binning have been copied from VAMB (https://github.com/RasmussenLab/vamb/blob/master/workflow/)). It has been setup as a snakefile in recommended_workflow/reads_to_bins.Snakefile.

The input to the workflow is the reads.csv file. The workflow can be run using snakemake:

snakemake --use-conda -s reads_to_bins.Snakefile --resources mem_gb=180 --config reads=reads.csv --cores 10 --printshellcmds 

Once the binning is done, we recommend using a tool like dRep (https://github.com/MrOlm/drep) to create the species-level clusters. The advantage of dRep is that the clustering parameters can be modified to create clusters that belong to different taxonomic levels. An R script is located in recommended workflow/MAGinator_setup.R, that will create the different input files for MAGinator adapted to the output of dRep.

Preparing data for MAGinator run

sed 's/@/_/g' assembly/all_assemblies.fasta > all_assemblies.fasta
sed 's/@/_/g' vamb/clusters.tsv > clusters.tsv

Now you are ready to run MAGinator.

Functional Annotation

To generate the functional annotation of the genes we recommend using EggNOG mapper (https://github.com/eggnogdb/eggnog-mapper).

You can download it and try to run it on the test data

mkdir test_out/functional_annotation
emapper.py -i test/genes/all_genes_rep_seq.fasta --output test_out/functional_annotation -m diamond --cpu 38

The eggNOG output can be merged with clusters.tsv and further processed to obtain functional annotations of the MAG, cluster or sample levels with the following command:

(echo -e '#sample\tMAG_cluster\tMAG\tfunction'; join -1 1 -2 1 <(awk '{print $2 "\t" $1}' clusters.tsv | sort) <(tail -n +6 annotations.tsv | head -n -3 | cut -f1,15 | grep -v '\-$' | sed 's/_[[:digit:]]\+\t/\t/' | sed 's/,/\n/g' | perl -lane '{$q = $F[0] if $#F > 0; unshift(@F, $q) if $#F == 0}; print "$F[0]\t$F[1]"' | sed 's/\tko:/\t/' | sort) | awk '{print $2 "\t" $2 "\t" $3}' | sed 's/_/\t/' | sort -k1,1 -k2,2n) > MAGfunctions.tsv

In this case the KEGG ortholog column 15 was picked from the eggNOG-mapper output. But by cutting e.g. column number 13, one would obtain GO terms instead. Refer to the header of the eggNOG-mapper output for other available functional annotations e.g. KEGG pathways, Pfam, CAZy, COGs, etc.

MAGinator workflow

This is what MAGinator does with your input (if you want to see all parameters run maginator --help):

Output