The Earth Microbiome Project (EMP) is a systematic attempt to characterize global microbial taxonomic and functional diversity for the benefit of the planet and humankind.
This GitHub repository describes the EMP catalogue -- how it is generated and how to use it. The EMP dataset is generated from samples that individual researchers have compiled and contributed to the EMP. Samples from each group of researchers represent individual EMP studies. In addition to analyses by contributing researchers on individual studies, we perform cross-study meta-analyses. EMP 16S Release 1, a meta-analysis of the first 97 16S rRNA amplicon studies, has been published (article, preprint), and the code and methods used for that manuscript are provided here. EMP 16S Release 2, currently unpublished, includes additional 16S rRNA amplicon data. We are currently finalizing the EMP500 - a mult-omics meta-analysis of 50 studies including >500 samples each processed for 16S, 18S, ITS amplicon sequencing, shotgun metagenomic sequencing, and metabolic profiling (preprint). Methods and standard operating procedures (SOPs) for additional amplicon sequencing, shotgun sequencing, and metabolomics related to EMP 16S release 2 and the EMP500 are also provided here.
This repository contains the directories listed below. Each directory will have contents related to EMP 16S Release 1 and EMP Multi-omics (EMP500).
methods
Methods used in EMP analyses. Includes sample processing for extraction and sequencing, and computational methods for performing analyses and generating figures for meta-analyses of the EMP dataset.protocols
Laboratory protocols and SOPs for sample and metadata collection, sample tracking, amplicon sequencing, shotgun sequencing, and metabolomics.code
IPython notebooks and scripts (Python, Java, R, Bash) developed for meta-analysis of EMP data; this code is used in methods
.data
Data files resulting from or used in processing and analysis.papers
Preprints of major meta-analyses of the EMP dataset and links to papers about individual studies.presentations
Links to slide decks from presentations on the EMP.legacy
Early code, results, and website documents from the initial phase of the EMP (2010-2013).There are several ways to get involved with the EMP:
The EMP catalogue is a diverse and standardized set of thousands of microbiomes for use by the public. Here are some of the ways you can use this resource:
Query the EMP catalogue using Redbiom. Redbiom is a command-line tool that allows users to query the Qiita database, including EMP studies. It allows you to find samples based on the sequences or taxa they contain or on sample metadata, and to export selected sample data and metadata. Once you have Redbiom installed, you can carry out queries such as those described here:
# First, summarize the contexts available. A context represents a partition by
# processing parameters (e.g., closed-reference OTU picking) and preparation
# (e.g., 16S V4).
redbiom summarize contexts | cut -f 1,2,3
# Create a variable for the context. For this example, we will use the closed-
# reference 16S V4 context by setting a local bash variable "ctx".
ctx=Pick_closed-reference_OTUs-illumina-16S-v4-66f541
# Query 1: "Show me all the genera that were observed at pH > 8."
# First we search for samples with pH > 8, then select the features from those
# samples, then summarize the taxonomy of those features, then grep for just
# the genera and count them.
redbiom search metadata "where ph > 8" | redbiom select features-from-samples \
--context $ctx | redbiom summarize taxonomy --context $ctx | grep g__ | wc -l
# Answer: There are 1423 genera found in samples with pH > 8.
# Query 2: "Show me all sites where Pyrobaculum are found."
# First we search for features that are genus Pyrobaculum, then search for
# samples containing those features, then fetch sample metadata for those
# samples and output the metadata file, then grab the columns for latitude and
# longitude (note: these are not guaranteed to reside in columns 10 and 11).
redbiom search taxon --context $ctx g__Pyrobaculum | redbiom search features \
--context $ctx | redbiom fetch sample-metadata --context $ctx \
--output g__Pyrobaculum_metadata.txt; cut g__Pyrobaculum_metadata.txt -f 10,11
If you use the EMP 16S Release 1 data in your research, please cite Thompson et al., "A communal catalogue reveals Earth's multiscale microbial diversity", Nature, 2017 (article).
If you use the EMP500 data in your research, please cite Shaffer-Nothias-Thompson et al., "Multi-omics profiling of Earth’s biomes reveals that microbial and metabolite composition are shaped by the environment", bioRxiv, 2022 (preprint).
If you use EMP protocols in your research, please cite earthmicrobiome.org and the relevant papers referenced therein.
Some abbreviations used in this repository:
demux
is shorthand for "demultiplexed", which describes the fastq data after it is split into per-sample fastq files using barcodes.deblur
refers to the exact-sequence de novo OTU picking method Deblur.cr
refers to closed-reference OTU picking.or
refers to open-reference OTU picking.refseqs
refers to reference sequence collections that could be used in reference-based OTU picking.mc2
refers to minimum sequence count in an OTU to be included equals to 2.If you're looking for data generated and used for the ISME 14 EMP presentations, look here.