czbiohub-sf / MIDAS

Metagenomic Intra-Species Diversity Analysis (MIDAS)
MIT License
36 stars 10 forks source link

Metagenomic Intra-Species Diversity Analysis

MIDAS 2

DOI

Metagenomic Intra-Species Diversity Analysis (MIDAS) is an integrated pipeline for profiling strain-level genomic variations in shotgun metagenomic data. The standard MIDAS workflow harnesses a reference database of 5,926 species extracted from 30,000 genomes (MIDAS DB v1.2). MIDAS2 used the same analysis workflow as the original MIDAS tool, and is engineered to work with more comprehensive MIDAS Reference Databases (MIDASDBs), and to run on collections of thousands of samples in a fast and scalable manner.

For MIDAS2, we have already built two MIDASDBs from large, public, microbial genome databases: UHGG 1.0 and GTDB r202.

Publication is available in Bioinformatics. User manual is available at ReadTheDocs.

The performance of reads mapping based metagenotyping pipeline depends on (1) how closely related the DB reference genomes are to the strains in the samples being genotyped, and (2) post-alignment filter options, and etc. Pitfalls of genotyping microbial communities with rapidly growing genome collections can be found here.

Quick Installation:

conda create -n midas2 -c zhaoc1 -c conda-forge -c bioconda -c anaconda -c defaults midas

MIDAS version 3

DOI

MIDAS version 3, previously known as MIDAS2, features major updates to its pangenome database. These updates include a refinded curation process and a comprehensive functional annotation pipeline. MIDASDB can construct species-level pangenome databases from external reference genome collections, e.g. UHGG or GTDB, by clustering predicted genes into operational gene families (OGFs) at various average nucleotide identity (ANI) thresholds, with representative gene sequences of each OGF assigned as the centroids by vsearch.

  1. MIDAS v3 made significant changes to the curation pipeline aiming to minimize the impact of fragmented gene sequences, spurious gene calls, chimeric assemblies, and redundant OGFs resulting from errors from cross-species contamination and highly fragmented MAGs.
  2. Functional annotation includes a voting mechanism to assess the ratio of genes in each OGF related to phages, plasmids, mobile elements, and antimicrobial resistance, which is an improvement over common methods that relied on single centroid genes.
  3. For pangenome profiling, MIDAS v3 compiles representative gene sequences from the 99% ANI level OGFs into a Bowtie2 index for alignment and quantification. It also prunes potentially spurious singletons at 75% level or/and short OGFs. Vertical gene family coverage is calculated as the number of aligned reads over the gene length.

     The first step is to generate the pruned centroids sequences for species of interests.

midas prune_centroids --midasdb_name localdb --midasdb_dir /path/to/midasdb-uhgg-v2 -t 1 --remove_singleton --species 100001 --force --debug

    The second step is to pass the arguments to run_genes

midas run_genes --midasdb_name localdb --midasdb_dir /path/to/midasdb-uhgg-v2 --num_cores 8 --select_threshold=-1 --species_list 100723,104323,100041 --prune_centroids --remove_singleton midas_output

Details of these updates can be found at the provided link.

Quick Installation:

conda config --set channel_priority flexible
conda create -n midasv3 -c zhaoc1 -c conda-forge -c bioconda -c anaconda -c defaults midasv3=1.0.0
bash tests/test_analysis.sh 8