biobakery / humann

HUMAnN is the next generation of HUMAnN 1.0 (HMP Unified Metabolic Analysis Network).
http://huttenhower.sph.harvard.edu/humann
Other
170 stars 58 forks source link
biobakery public python tools

ATTENTION

Before opening a new issue here, please check the appropriate help channel on the bioBakery Support Forum (https://forum.biobakery.org) and consider opening or commenting on a thread there.


HUMAnN User Manual

HUMAnN is the next generation of HUMAnN (HMP Unified Metabolic Analysis Network).


If you use HUMAnN in your work, please cite the HUMAnN 3.0 paper:

Francesco Beghini1 ,Lauren J McIver2 ,Aitor Blanco-Mìguez1 ,Leonard Dubois1 ,Francesco Asnicar1 ,Sagun Maharjan2,3 ,Ana Mailyan2,3 ,Andrew Maltez Thomas1 ,Paolo Manghi1 ,Mireia Valles-Colomer1 ,George Weingart2,3 ,Yancong Zhang2,3 ,Moreno Zolfo1 ,Curtis Huttenhower2,3 ,Eric A Franzosa2,3 ,Nicola Segata1,4

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

eLife 2021;10:e65088

1 Department CIBIO, University of Trento, Italy
2 Harvard T. H. Chan School of Public Health, Boston, MA, USA
3 The Broad Institute of MIT and Harvard, Cambridge, MA, USA
4 IEO, European Institute of Oncology IRCCS, Milan, Italy

And feel free to link to HUMAnN in your Methods:

http://huttenhower.sph.harvard.edu/humann


For additional information, read the HUMAnN Tutorial

HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. More generally, functional profiling answers the question "What are the microbes in my community-of-interest doing (or capable of doing)?"


Contents


Features

  1. Community functional profiles stratified by known and unclassified organisms

    • MetaPhlAn and ChocoPhlAn pangenome database are used to facilitate fast, accurate, and organism-specific functional profiling
    • Organisms included are Archaea, Bacteria, Eukaryotes, and Viruses
  2. Considerably expanded databases of genomes, genes, and pathways

    • UniRef database provides gene family definitions
    • MetaCyc provides pathway definitions by gene family
    • MinPath is run to identify the set of minimum pathways
  3. A simple user interface (single command driven flow)

    • The user only needs to provide a quality-controlled metagenome or metatranscriptome
  4. Accelerated mapping of reads to reference databases (including run-time generated databases tailored to the input)

    • Bowtie2 is run for accelerated nucleotide-level searches
    • Diamond is run for accelerated translated searches

Workflows

Main workflow

Workflow by input file type

There are four different types of files that can be provided as input to HUMAnN. By default HUMAnN will determine the type of the file. As shown in the figure below, the type of input file will determine where HUMAnN will start the workflow. Files of type 2, 3, and 4 will begin the workflow after the alignment steps.

File Types:


Workflow by bypass mode

There are multiple bypass options that will allow you to adjust the standard workflow.

Bypass options:


Workflow of the resume option

HUMAnN includes a "--resume" option which will allow you to bypass alignment steps which have already been completed. For example, if you originally ran with a bypass option you can run just the step you bypassed with "--resume". This will only run the alignment step you bypassed and then recompute the gene families and pathways.

When using the "--resume" option, the following steps will be bypassed if they have already been completed:

  1. Taxomonic profiling step
  2. Nucleotide alignment step
  3. Custom ChocoPhlAn database creation (merge and index)
  4. Translated alignment step

Requirements

Software

  1. MetaPhlAn
  2. Bowtie2 (version >= 2.2.9) (automatically installed)
  3. Diamond (version >= 0.9.24) (automatically installed)
  4. Python (version >= 3.7)
  5. MinPath (automatically installed)
  6. Xipe (optional / included)
  7. SAMtools (only required if bam input files are provided)
  8. Biom-format (only required if input or output files are in biom format)

Please install the required software in a location in your $PATH or provide the location with an optional argument to HUMAnN 3.0. For example, the location of the Bowtie2 install ($BOWTIE2_DIR) can be provided with "--bowtie2 $BOWTIE2_DIR".

Some of the required dependencies are installed automatically when installing HUMAnN with pip. If these dependencies do not appear to be installed after installing HUMAnN with pip, it might be that your environment is setup to use wheels instead of installing from source. HUMAnN must be installed from source for it to also be able to install dependencies. To force pip to install HUMAnN from source add one of the following options to your install command, "--no-use-wheel" or "--no-binary :all:".

If you always run with input files of type 2, 3, and 4 (for information on input file types, see section Workflow by input file type), MetaPhlAn2, Bowtie2, and Diamond are not required. Also if you always run with one or more bypass options (for information on bypass options, see section Workflow by bypass mode), the software required for the steps you bypass does not need to be installed.

Other

  1. Memory (>= 16 Gb)
  2. Disk space (>= 15 Gb [to accommodate comprehensive sequence databases])
  3. Operating system (Linux or Mac)

If always running with files of type 2, 3, and 4 (for information on file types, see section Workflow by input file type), less disk space is required.


Initial Installation

1. Download HUMAnN

You can download the latest HUMAnN release or the development version. The source contains example files. If installing with pip, it is optional to first download the HUMAnN source.

Option 1: Latest Release (Recommended)

Option 2: Development Version


2. Install HUMAnN

Installing with pip

  1. Install HUMAnN
    • $ pip install humann
    • This command will automatically install MinPath (and a new version of glpk) along with Bowtie2 and Diamond (if they are not already installed).
    • To bypass the install of Bowtie2 and Diamond, add the option "--install-option='--bypass-dependencies-install'" to the install command.
    • To build Diamond from source during the install, add the option "--install-option='--build-diamond'" to the install command.
    • To overwite existing installs of Bowtie2 and Diamond, add the option "--install-option='--replace-dependencies-install'" to the install command.
    • If you do not have write permissions to '/usr/lib/', then add the option "--user" to the HUMAnN install command. This will install the python package into subdirectories of '\~/.local' on Linux. Please note when using the "--user" install option on some platforms, you might need to add '~\/.local/bin/' to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message humann: command not found when trying to run HUMAnN after installing with the "--user" option.

Installing from source

  1. Move to the HUMAnN directory

    • $ cd $HUMAnN_PATH
  2. Install HUMAnN

    • $ python setup.py install
    • This command will automatically install MinPath (and new version of glpk) along with Bowtie2 and Diamond (if they are not already installed).
    • To bypass the install of Bowtie2 and Diamond, add the option "--bypass-dependencies-install" to the install command.
    • If you do not have write permissions to '/usr/lib/', then add the option --user to the install command. This will install the python package into subdirectories of '\~/.local'. Please note when using the "--user" install option on some platforms, you might need to add '\~/.local/bin/' to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message humann: command not found when trying to run HUMAnN after installing with the "--user" option.

3. Test the install

  1. Test out the install with unit and functional tests
    • $ humann_test
    • To also run tool tests, add the option "--run-functional-tests-tools".
    • To also run end-to-end tests, add the option "-run-functional-tests-end-to-end". Please note these tests take about 20 minutes to run. Also they require all dependencies of HUMAnN be installed in your PATH.

4. Try out a demo run

With HUMAnN installed you can try out a demo run using reduced versions of the databases. If installing from pip, please also download the source as it contains the demo examples.

$ humann --input examples/demo.fastq --output $OUTPUT_DIR

Output from this demo run will be written to the folder $OUTPUT_DIR.

Please continue with the install directions to download the full databases before running with your sequencing data.


5. Download the databases

Downloading the databases is a required step if your input is a filtered shotgun sequencing metagenome file (fastq, fastq.gz, fasta, or fasta.gz format). If your input files will always be mapping results files (sam, bam or blastm8 format) or gene tables (tsv or biom format), you do not need to download the ChocoPhlAn and translated search databases.


Download the ChocoPhlAn database

Download the ChocoPhlAn database providing $INSTALL_LOCATION as the location to install the database (approximate size = 16.4 GB).

$ humann_databases --download chocophlan full $INSTALL_LOCATION

NOTE: The humann config file will be updated to point to this location for the default chocophlan database. If you move this database, please use the "humann_config" command to update the default location of this database. Alternatively you can always provide the location of the chocophlan database you would like to use with the "--nucleotide-database " option to humann.


Download a translated search database

While there is only one ChocoPhlAn database, users have a choice of translated search databases which offer trade-offs between resolution, coverage, and performance. For help selecting a translated search database, see the following guides:

Download a translated search database providing $INSTALL_LOCATION as the location to install the database:

NOTE: The humann config file will be updated to point to this location for the default uniref database. If you move this database, please use the humann_config command to update the default location of this database. Alternatively you can always provide the location of the uniref database you would like to use with the --protein-database <uniref> option to humann.

Note: Please make sure that all the databases in the directory are of same version. Different versions of database must be separated in different directory.


Installation Update

If you have already installed HUMAnN, using the Initial Installation steps, and would like to upgrade your installed version to the latest version, please follow these steps.

  1. Download HUMAnN
  2. Install HUMAnN

Since you have already downloaded the databases in the initial installation, you do not need to download the databases again unless there are new versions available. However, you will want to update your latest HUMAnN install to point to the databases you have downloaded as by default the new install configuration will point to the demo databases.

To update your HUMAnN configuration file to include the locations of your downloaded databases, please use the following steps.

  1. Update the location of the ChocoPhlAn database (replacing $DIR with the full path to the directory containing the ChocoPhlAn database)

    $ humann_config --update database_folders nucleotide $DIR

  2. Update the location of the UniRef database (replacing $DIR with the full path to the directory containing the UniRef database)

    $ humann_config --update database_folders protein $DIR

Please note, after a new installation, all of the settings in the configuration file, like the database folders, will be reset to the defaults. If you have any additional settings that differ from the defaults, please update them at this time.

For more information on the HUMAnN configuration file, please see the Configuration section.


How to run

Basic usage

$ humann --input $SAMPLE --output $OUTPUT_DIR

$SAMPLE = a single file that is one of the following types:

  1. filtered shotgun sequencing metagenome file (fastq, fastq.gz, fasta, or fasta.gz format)
  2. alignment file (sam, bam or blastm8 format)
  3. gene table file (tsv or biom format)

$OUTPUT_DIR = the output directory

Five main output files will be created:

  1. $OUTPUT_DIR/$SAMPLENAME_0.log
  2. $OUTPUT_DIR/$SAMPLENAME_1_metaphlan_profile.tsv
  3. $OUTPUT_DIR/$SAMPLENAME_2_genefamilies.tsv
  4. $OUTPUT_DIR/$SAMPLENAME_3_reactions.tsv
  5. $OUTPUT_DIR/$SAMPLENAME_4_pathabundance.tsv

where $SAMPLENAME is the basename of $SAMPLE.

In HUMAnN v3 the log file and the taxonomic profile are included in the intermediate temp folder. Also in HUMAnN v3 the files do not include numbers in their names.

*The gene families file will not be created if the input file type is a gene table.

Intermediate temp files will also be created:

  1. $DIR/$SAMPLENAME_bowtie2_aligned.sam
    • the full alignment output from bowtie2
  2. $DIR/$SAMPLENAME_bowtie2_aligned.tsv
    • only the reduced aligned data from the bowtie2 output
  3. $DIR/$SAMPLENAME_bowtie2_index
    • bowtie2 index files created from the custom chochophlan database
  4. $DIR/$SAMPLENAME_bowtie2_unaligned.fa
    • a fasta file of unaligned reads after the bowtie2 step
  5. $DIR/$SAMPLENAME_custom_chocophlan_database.ffn
    • a custom chocophlan database of fasta sequences
  6. $DIR/$SAMPLENAME_metaphlan_bowtie2.txt
    • the bowtie2 output from metaphlan
  7. $DIR/$SAMPLENAME_$TRANSLATEDALIGN_aligned.tsv
    • the alignment results from the translated alignment step
  8. $DIR/$SAMPLENAME_$TRANSLATEDALIGN_unaligned.fa
    • a fasta file of unaligned reads after the translated alignment step

NOTE: $SAMPLENAME can be set by the user with the option --output-basename <$NEWNAME>.


Demo runs

The examples folder contains four demo example input files. These files are of fasta, fastq, sam, and blastm8 format. Blastm8 format is created by the following software: diamond.

To run the fasta demo:

$ humann --input examples/demo.fasta --output $OUTPUT_DIR

To run the fastq demo:

$ humann --input examples/demo.fastq --output $OUTPUT_DIR

To run the sam demo:

$ humann --input examples/demo.sam --output $OUTPUT_DIR

To run the blastm8 demo:

$ humann --input examples/demo.m8 --output $OUTPUT_DIR

$OUTPUT_DIR is the output directory

Since sam and blastm8 are mapping results, using these files as input to HUMAnN will bypass both the nucleotide and translated mapping portions of the flow.


Standard workflow

The standard workflow involves running HUMAnN on each filtered shotgun sequencing metagenome file, normalizing, and then merging the output files.

Prior to running the workflow, filter your shotgun sequencing metagenome file. We recommend using KneadData for metagenome quality control including removal of host reads.

To run the standard workflow, follow these steps:

  1. Run HUMAnN on each of your filtered fastq files in $INPUT_DIR placing the results in $OUTPUT_DIR

    • for $SAMPLE.fastq in $INPUT_DIR
      • $ humann --input $SAMPLE.fastq --output $OUTPUT_DIR
        • Replace $SAMPLE.fastq with the name of the fastq input file
        • Replace $OUTPUT_DIR with the full path to the folder to write output
      • If you have paired-end reads, please see the guide Paired-end reads
    • The results will be three main output files for each input file named $SAMPLE_2_genefamilies.tsv and $SAMPLE_4_pathabundance.tsv.
  2. Normalize the abundance output files

    • We recommend normalizing the abundance data based on the statistical tests you will perform.
    • Prior to nomalization, select the scheme to use (copies per million or relative abundance). For example, if using MaAsLin, select relative abundance.
    • Use the HUMAnN tool renorm table, to compute the normalized abundances (relative abundance is selected in the example command below)
      • for $SAMPLE_2_genefamilies.tsv in $OUTPUT_DIR
        • $ humann_renorm_table --input $SAMPLE_2_genefamilies.tsv --output $SAMPLE_2_genefamilies_relab.tsv --units relab
    • Please note, gene family abundance is reported in RPK (reads per kilobase). This is computed as the sum of the scores for all alignments for a gene family. An alignment score is based on the number of matches to the reference gene for a specific sequence. It is divided by the length of the reference gene in kilobases to normalize for gene length. Each alignment score is also normalized to account for alignments for a single sequence to multiple reference genes. Alignments are not considered if they do not pass the e-value, identity, and coverage thresholds.
    • If you would like to normalize using the number of reads aligned per input file, this count along with the total number of reads and the percent unaligned reads after each alignment step is included in the log file. For more information on what is included in the log file, see the Intermediate temp output file section Log.
    • Alternatively, gene families can be regrouped to different functional categories prior to normalization. See the guide to humann_regroup_table for detailed information.
  3. Join the output files (gene families and abundance) from the HUMAnN runs from all samples into three files

    • $ humann_join_tables --input $OUTPUT_DIR --output humann_2_genefamilies.tsv --file_name genefamilies_relab
    • $ humann_join_tables --input $OUTPUT_DIR --output humann_4_pathabundance.tsv --file_name pathabundance_relab
    • For each command, replace $OUTPUT_DIR with the full path to the folder containing the HUMAnN output files.
    • The resulting files from these commands are named humann_2_genefamilies.tsv and humann_4_pathabundance.tsv.

Output files

When HUMAnN is run, three main output files will be created (where $SAMPLENAME = the basename of $SAMPLE):

1. Gene families file

# Gene Family   $SAMPLENAME_Abundance-RPKs
READS_UNMAPPED        187.0
UniRef50_unknown        150.0
UniRef50_unknown|g__Bacteroides.s__Bacteroides_fragilis 150.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon    67.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_fragilis 57.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_finegoldii   5.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_stercoris    4.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|unclassified   1.0
UniRef50_O83668: Fructose-bisphosphate aldolase 60.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_vulgatus  31.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_thetaiotaomicron  22.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_stercoris 7.0

2. Reactions file

# Reaction      $SAMPLENAME_Abundance
READS_UNMAPPED      7573721.0000000000
UNGROUPED    22978726.4009404778
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus     1961355.5951347812
UNGROUPED|g__Bacteroides.s__Bacteroides_uniformis    1579305.9455720731
UNGROUPED|unclassified       5100980.1244789278
2.7.13.3-RXN    75608.6164459985
2.7.13.3-RXN|g__Bacteroides.s__Bacteroides_uniformis    7962.2303221576
2.7.13.3-RXN|g__Faecalibacterium.s__Faecalibacterium_prausnitzii        5371.8462336209
2.7.13.3-RXN|unclassified       12529.4324849206

3. Pathway abundance file

# Pathway   $SAMPLENAME_Abundance
UNMAPPED    140.0
UNINTEGRATED    87.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae   23.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_finegoldii   20.0
UNINTEGRATED|unclassified   12.0
PWY0-1301: melibiose degradation    57.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_caccae   32.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_finegoldii   4.5
PWY0-1301: melibiose degradation|unclassified   3.0
PWY-5484: glycolysis II (from fructose-6P)  54.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_caccae 16.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_finegoldii 8.0
PWY-5484: glycolysis II (from fructose-6P)|unclassified 6.0

4. MetaPhlAn profile

#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       100.0
k__Bacteria|p__Bacteroidetes    2|11    73.86802
k__Bacteria|p__Firmicutes       2|22    26.13198
k__Bacteria|p__Bacteroidetes|c__Bacteroidia     2|11|111        73.86802
k__Bacteria|p__Firmicutes|c__Clostridia         2|22|222        15.60912
k__Bacteria|p__Firmicutes|c__Negativicutes      2|22|333        10.52286
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales    2|11|111|1111   73.86802
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales        2|22|222|4444   15.60912
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales   2|22|333|5555   10.52286
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae  2|11|111|1111|77777     51.32768

5. Intermediate temp output files

Ten intermediate temp output files will be created where:

$DIR = $OUTPUT_DIR/$SAMPLENAME_humann_temp/
$SAMPLENAME = basename of the fastq/fasta input file named $SAMPLE
$TRANSLATEDALIGN = translated alignment software selected (diamond)

NOTE: $SAMPLENAME can be set by the user with the option --output-basename <$NEWNAME>

1. Bowtie2 alignment results

@HD VN:1.0  SO:unsorted
@SQ SN:g__Ruminococcus.s__Ruminococcus_bromii|UniRef90_D4L6K4|UniRef50_R6U703   LN:540
r99491  0   g__Bacteroides.s__Bacteroides_stercoris|UniRef90_R6B629|UniRef50_R5RCC8 1015    42  151M    *   0   0   CTGACCGATTTATATGAGGA    zzzzzzzzzzzzzzzzzzzz
r99526  0   g__Parabacteroides.s__Parabacteroides_merdae|UniRef90_unknown|UniRef50_D9RX34   155 42  151M    *   0   0   AATTTTCTTCAAAAAATATA    zzzzzzzzzzzzzzzzzzzz
r99581  16  g__Bacteroides.s__Bacteroides_stercoris|UniRef90_unknown|UniRef50_R6SXR7    2503    42  151M    *   0   0   TCTTTTATGCAGGGGATATG    zzzzzzzzzzzzzzzzzzzz

2. Bowtie2 reduced alignment results

r93 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468    99.0    100.0                                                   0
r113    gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468    99.0    100.0                                                   0   
r704    gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468    99.0    100.0                                                   0   
r663    gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468    99.0    100.0                                                   0   
r940    gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468    99.0    100.0                                                   0    

3. Bowtie2 index files


4. Unaligned reads after Bowtie2

>r4370
GGCGGACGATCTTGTCGCCCAGCCTGTAGCCTTTCTGGTACACCGTGATGACGGTGCCGCTCTCCTGCCCGTCCGTGGCGGGGATCTGCTGG
>r4398
TGCCCGGACAGGATCTTCTCTTTCGTACCGGGCATCATCTGCTCCATGATCTCCACGCCTCGCATGAACTTTTCAGAACGGGCAACGTAGGA

5. Custom ChocoPhlAn database

>gi|479150083|ref|NC_021013.1|:976220-976759|40518|g__Ruminococcus.s__Ruminococcus_bromii|UniRef90_D4L6K4|UniRef50_R6U703
ATGTTCTATGTATTTCTTGCAGAAGGCTTTGAAGAAACAGAGGCGCTTGCCCCCGTTGATGTAATGCGCAGGGCAAAGCT
TGATGTTAAAACAGTCGGTGTAACAGGCGAATGTGTTACAAGCTCACACGGTGTGCCTGTAAAAGCCGATATCACAATTG
ACAATATTGACCTTGACGATGTTCAGGGTGTTGTACTCCCCGGTGGTATGCCCGGAACTCTCAATCTTGAGGCAAACAAA
AAGGTTCTTGAGGCTGTTAAGTATAGCTGTGAAAACGGCAAAATCGTTGCCGCAATCTGTGCCGCTCCGTCAATTCTCGG

6. MetaPhlAn Bowtie2 output

r113    gi|224485636|ref|NZ_EQ973490.1|:c728571-728107
r559    gi|479185170|ref|NC_021030.1|:c1678719-1677127
r663    gi|512436175|ref|NZ_KE159463.1|:c142391-139122
r704    gi|423310881|ref|NZ_JH724270.1|:c220428-218656
r1086   gi|238922432|ref|NC_012781.1|:c1988048-1987140 

7. Translated alignment results

r2805   UniRef50_E2ZJD8|627 37.50   48  30  0   147 4   152 199 5e-06   40.0
r2805   UniRef50_E2ZHU1|963 37.50   48  30  0   147 4   152 199 7e-06   39.7
r2805   UniRef50_K1UMQ9|612 35.42   48  31  0   147 4   35  82  2e-05   38.5
r3036   UniRef50_K1TCN4|540 100.00  22  0   0   150 85  148 169 8e-10   52.8
r3036   UniRef50_UPI00046A4B12|696  35.42   48  30  1   149 6   88  134 1e-05   38.9

8. Translated alignment unaligned reads

>r4370
GGCGGACGATCTTGTCGCCCAGCCTGTAGCCTTTCTGGTACACCGTGATGACGGTGCCGCTCTCCTGCCCGTCCGTGGCGGGGATCTGCTGG
>r4398
TGCCCGGACAGGATCTTCTCTTTCGTACCGGGCATCATCTGCTCCATGATCTCCACGCCTCGCATGAACTTTTCAGAACGGGCAACGTAGGA

9. Log

03/16/2015 01:09:52 PM - humann.utilities - INFO: File ( demo.fastq ) is of format:  fastq
03/16/2015 01:09:52 PM - humann.config - INFO: Run config settings:
DATABASE SETTINGS
nucleotide database folder = data/chocophlan_DEMO
protein database folder = data/uniref_DEM

Databases

HUMAnN uses two databases for alignment, ChocoPhlAn and UniRef. There are different formats of these databases. The demo formats of both are included in the HUMAnN install.

To see which databases are currently available, run (with example output included below):

$ humann_databases --available
HUMANnN Databases ( database : build = location )
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.tar.gz
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref50_GO_filtered/uniref50_DEMO_diamond.tar.gz
uniref : diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref50_GO_filtered/uniref50_GO_filtered_diamond.tar.gz

To download a database, run:

$ humann_databases --download $DATABASE $BUILD $INSTALL_LOCATION

This will automatically update the HUMAnN configuration. If you do not have permissions to edit the config file run with the option --update-config no and then include the locations of the databases when running HUMAnN.


Configuration

HUMAnN uses a configuration file to store user configuration settings. This configuration file is automatically updated when a database is installed.

To view the current settings of the configuration file, run (with example output included below):

$ humann_config --print
HUMAnN Configuration ( Section : Name = Value )
output_format : remove_stratified_output = False
output_format : output_max_decimals = 10
alignment_settings : prescreen_threshold = 0.5
alignment_settings : evalue_threshold = 1.0
alignment_settings : identity_threshold = 50.0
database_folders : nucleotide = data/chocophlan_DEMO
database_folders : protein = data/uniref_DEMO
run_modes : bypass_nucleotide_search = False
run_modes : verbose = False
run_modes : bypass_nucleotide_index = False
run_modes : bypass_translated_search = False
run_modes : resume = False
run_modes : threads = 1
run_modes : bypass_prescreen = False

To change a value in the configuration file, run:

$ humann_config --update $SECTION $NAME $VALUE 

The settings included in the configuration file plus others selected at run-time are printed to the beginning of the log file for each run.

Example log file configuration settings section:

03/16/2015 01:09:52 PM - humann.config - INFO: 
Run config settings: 

DATABASE SETTINGS
nucleotide database folder = data/chocophlan_DEMO
protein database folder = data/uniref_DEMO
pathways database file 1 = data/pathways/metacyc_reactions.uniref
pathways database file 2 = data/pathways/metacyc_pathways

RUN MODES
resume = False
verbose = True
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = False
translated search = diamond
pick frames = off
threads = 4

ALIGNMENT SETTINGS
evalue threshold = 1.0
prescreen threshold = 0.5
identity threshold = 50.0

PATHWAYS SETTINGS
minpath = on
xipe = off

INPUT AND OUTPUT FORMATS
input file format = fastq
output file format = tsv
output max decimals = 10
remove stratified output = False
log level = DEBUG

Guides to HUMAnN utility scripts

humann_barplot

humann_barplot produces plots of stratified HUMAnN features. Note: unlike many other HUMAnN utilities, humann_barplot requires the Python scientific stack (notably matplotlib) to operate. humann_barplot includes many options for sorting and scaling data, as detailed in the --help menu. humann_barplot is also explored in greater detail in the HUMAnN.

Here is an example of a HUMAnN barplot for a pathway (denitrification) that was preferentially enriched in Human Microbiome Project oral samples relative to other body sites. This figure uses many options from humann_barplot, including regrouping by genus, pseudolog scaling, and sorting samples by similarity and metadata:

page_DENITRIFICATION-PWY.png


humann_config

Used for viewing and setting HUMAnN defaults. See the Configuration section above for details.


humann_databases

Used for downloading ChocoPhlAn, translated search databases, and HUMAnN utility mapping files. See the Databases section above for details.


humann_gene_families_genus_level

See the Genus level gene families and pathways guide below.


humann_join_tables

This will join (merge) multiple single-sample output files into a single table with multiple samples. It can also be used to join multiple multi-sample output files into a single table.


humann_reduce_table

Used for collapsing joined MetaPhlAn taxonomic profiles to a single joint profile. See the Joint taxonomic profile guide below.


humann_regroup_table

HUMAnN genefamilies.tsv output can contain a very large number of features depending on the complexity of your underlying sample. One way to explore this information in a simplified manner is via HUMAnN's own pathway abundance, which summarize the values of their member genes. However, this approach does not apply to gene families that are not associated with metabolic pathways.

To further simplify the exploration of gene family abundance data, users can regroup gene families into other functional categories using humann_regroup_table. This script takes as arguments a gene family abundance table and a mapping (groups) file that indicates which gene families belong to which groups. Out of the box, HUMAnN can regroup gene families to MetaCyc reactions (a step which is also used internally as part of MetaCyc pathway quantification). Users can download additional mapping files using the HUMAnN databases script:

$ humann_databases --download utility_mapping full $DIR

Where $DIR is the location where you'd like to store the mapping files. In addition to downloading the mapping files, this command also registers the mapping files with humann_regroup_table (running the script with the -h flag will reveal an expanded list of regrouping options).

Mappings are available for both UniRef90 and UniRef50 gene families to the following systems:

In most cases, mappings are directly inferred from the annotation of the corresponding UniRef centroid sequence in UniProt.

One exception to this are the "informative GO" (infogo1000) maps: These are informative subsets of GO computed from UniProt's annotations and the structure of the GO hierarchy specifically for HUMAnN (each informative GO term has >1,000 UniRef centroids annotated to it, but none of its progeny terms have >1,000 centroids so annotated).

Users are free to create and use additional mapping files and pass them to humann_regroup_table via the --custom mapping flag. The format of a mapping file is:

group1  uniref1  uniref2  uniref3  ...
group2  uniref1  uniref5  ...

Where spaces between items above denote TABS. Mapping files can be gzip- or bzip2-compressed and will still be readable by humann_regroup_table. By default, feature abundances (such as gene families) are summed to produce group abundances.

If the "UNMAPPED" gene abundance feature is included in a user's input to humann_regroup_table, it will automatically be carried forward to the final output. In addition, genes that do not group with a non-trivial feature are combined as an "UNGROUPED" group. By default, UNGROUPED reflects the total abundance of genes that did not belong to another group (similar in spirit to the "UNINTEGRATED" value reported in the pathway abundance file).

Some groups are not associated by default with human-readable names. To attach names to a regrouped table, use the humann_rename_table script. (The "GO" name map can be used for both raw GO and informative GO.)


humann_rename_table

By default, larger HUMAnN outputs do not attach names (glosses) to individual feature IDs to keep file sizes down. This script is used to attach those names. The most common use for this script is attaching gene names to UniRef90/UniRef50 features in the default genefamilies.tsv output files.

Features without names (common for many UniRef90/50 families) are assigned the default name: NO_NAME.


humann_renorm_table

HUMAnN quantifies genes and pathways in units of RPKs (reads per kilobase). These account for gene length but not sample sequencing depth. While there are some applications, e.g. strain profiling, where RPK units are superior to depth-normalized units, most of the time a user will renormalize their samples prior to downstream analysis.

This script provides the choice to normalize to relative abundance or copies per million (CPM) units. Both of these represent "total sum scaling (TSS)"-style normalization: in the former case, each sample is constrained to sum to 1, whereas in the latter case (CPMs) samples are constrained to sum to 1 million. Units out of 1 million are often more convenient for tables with many, many features (such as genefamilies.tsv tables).

Note: CPM as used here does not refer to unnormalized COUNTS per million, but rather copies per million. CPMs as used here are a generic analog of the TPM (transcript per million) unit in RNA-seq. You may wish to use the abbreviation CoPM for added clarity.

By default, this script normalizes all stratification levels to the sum of all community feature totals, but other options (such as level-wise normalization) are supported. "Special" features (such as UNMAPPED) can be included or excluded in the normalization process.


humann_split_stratified_table

This utility will split a table into two files (one stratified and one unstratified).


humann_split_table

This utility will split a merged feature table (multiple samples) into one file per sample. Some analyses can only accept one sample at a time.


humann_test

Utility for executing HUMAnN tests. See Test the install above for further details.


humann_unpack_pathways

This utility will unpack the pathways to show the genes for each. It adds another level of stratification to the pathway abundance table by including the gene families (or EC) abundances.


Other HUMAnN guides

Selecting a level of gene family resolution

HUMAnN uses UniRef protein clusters as a gene families system. UniRef clusters are constructed by clustering proteins from UniProt to remove redundancy (UniRef100), further clustering non-redundant proteins at 90% identity and selecting representative sequences (UniRef90), and further clustering UniRef90 representative sequences at 50% identity to produce broader clusters (UniRef50). The representative of a given UniRef cluster is generally the best-annotated member of the cluster (which may or may not be the true centroid of the cluster). Additional information about UniRef can be found at the UniRef website and in the original UniRef publication.


Selecting a scope for translated search

HUMAnN falls back to translated search for reads that failed to align to a known pangenome. The scope of this translated search can be tuned, resulting in a balance between the fraction of unclassified reads mapped and performance. There are three possible modes:

See above for instructions on how to acquire a translated search database.

Which translated search scope should I pick?

For many users, filtered translated search will serve as a good default option. This will still provide 1,000s of gene-level features and an uncompromised metabolic network for the unclassified stratum. For extremely well-characterized samples, bypassing translated search is a reasonable option. For example, in the healthy human gut, the majority of sample reads (~80%) that can be mapped to a reference are mapped prior to translated search.


HUMAnN and paired-end sequencing data

End-pairing relationships are currently not taken into account during HUMAnN's alignment steps. This is due to the fact that HUMAnN strictly aligns reads to isolated coding sequences: either at the nucleotide level or through translated search. As such, it will frequently be the case that one read (READ1) will map inside a given coding sequence while its mate-pair (READ2) will not.

Example:

GENEGENEGENE
     READ1-------READ2

Penalizing such cases would be overly strict: in the absence of a the gene's genomic context, this looks like a perfectly reasonable alignment (READ2 may fall in a non-coding region and not align, or it may align to another [isolated] coding sequence). As a result, the best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file.

If you have paired-end Illumina sequencing data, processed by CASAVA v1.8+ the sequence identifiers for the pairs have the same id after truncating by space. To keep track of the individual reads throughout the HUMAnN workflow, the software will initially remove the spaces from the identifiers. This prevents read pairs from having the same id after being processed by bowtie2 and diamond. If this is the case for your read set, you will see an informative message printed to the screen and the log file indicating the removal of spaces from the sequence identifiers.


PICRUSt output

You can run HUMAnN with PICRUSt output from predict_metagenomes.py or metagenome_contributions.py as input. Output from metagenome_contributions.py can include taxonomy information which will be used by HUMAnN. The steps that follow are the same for output files from either PICRUSt script.

For information on the PICRUSt software, please see the project website. For questions about PICRUSt, please contact the PICRUSt user group. The instructions that follow are for PICRUSt version 1.0.0-dev.

If you are running HUMAnN with PICRUSt output as input, please follow these steps:

  1. Download the legacy kegg databases included in HUMAnN 1.0

    • The databases will be referred to in steps that follow with the path "humann1/data/*".
  2. Split the picrust output file (picrust.biom or picrust.tsv) into a single file per sample (written to $OUTPUT_DIR)

    • $ humann_split_table --input picrust.biom --output $OUTPUT_DIR
    • The option --taxonomy_index -1 can be added if taxonomy information is included in the biom input file with column -1 associated with K0s.
    • If using biom input files, biom version 2.1+ must be installed.
  3. Run HUMAnN on each of the new files in $OUTPUT_DIR placing the results in $OUTPUT_DIR2

    • for $SAMPLE.biom in $OUTPUT_DIR
      • $ humann --input $SAMPLE.biom --output $OUTPUT_DIR2 --pathways-database humann1/data/keggc
    • The option --remove-stratified-output can be added if you do not want the data stratified by bug.
    • The option --output-format biom can be added if you want the output to be in biom format.
    • To run with the kegg modules instead of kegg pathways provide the modules file --pathways-database humann1/data/modulec.
    • The input file can be in biom or tsv format.
  4. Join the pathways data files from the HUMAnN runs from all samples into two files

    • $ humann_join_tables --input $OUTPUT_DIR2 --output humann_4_pathabundance.tsv --file_name pathabundance
    • If the files being joined in this step are biom format, the ouput file will also be in biom format.

Please note the flag --verbose can be added to all commands.


Joint taxonomic profile

A joint taxonomic profile can be created from all of the samples in your set. To create this file and use it for your HUMAnN runs, please use the steps that follow.

  1. Create taxonomic profiles for each of the samples in your set with MetaPhlAn

  2. Join all of the taxonomic profiles, located in directory $DIR, into a table of taxonomic profiles for all samples (joined_taxonomic_profile.tsv)

    • $ humann_join_tables --input $DIR --output joined_taxonomic_profile.tsv
  3. Reduce this file into a taxonomic profile that represents the maximum abundances from all of the samples in your set

    • $ humann_reduce_table --input joined_taxonomic_profile.tsv --output max_taxonomic_profile.tsv --function max --sort-by level
  4. Run HUMAnN on all of the samples in your set, providing the max taxonomic profile

    • for $SAMPLE.fastq in samples
      • $ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --taxonomic-profile max_taxonomic_profile.tsv

An alterative to step 4, which will save computing time, is to first run a single sample with the taxonomic profile. The HUMAnN temp output folder for this sample will contain the bowtie2 indexed custom ChocoPhlAn database that can be provided when running your remaining samples. This will save compute time as this database will only be created once. Please see the steps below for the alternative to step 4.

  1. Run HUMAnN on one of your samples ($SAMPLE_1.fastq) providing the max taxonomic profile to create the custom indexed ChocoPhlAn database

    • $ humann --input $SAMPLE_1.fastq --output $OUTPUT_DIR --taxonomic-profile max_taxonomic_profile.tsv
    • The folder $OUTPUT_DIR/$SAMPLE_1_humann_temp/ will contain the custom indexed ChocoPhlAn database files
  2. Run HUMAnN on the rest of your samples providing the custom indexed ChocoPhlAn database ($OUTPUT_DIR/$SAMPLE_1_humann_temp/)

    • for $SAMPLE.fastq in samples
      • $ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --nucleotide-database $OUTPUT_DIR/$SAMPLE_1_humann_temp/ --bypass-nucleotide-index

Custom taxonomic profile

A custom taxonomic profile can be created to specify the taxa included in your samples. This file is used by HUMAnN to create the custom ChocoPhlAn database for your samples.

The custom taxonomic profile must be in a tab-demilited format and contain two columns (taxon, with both genus and species, and percent abundance). An example follows:

g__Bacteroides|s__Bacteroides_thetaiotaomicron  12.16326
g__Bacteroides|s__Bacteroides_cellulosilyticus  12.02768
g__Bacteroides|s__Bacteroides_caccae    11.43394
g__Dialister|s__Dialister_invisus   10.52286
g__Bacteroides|s__Bacteroides_stercoris 10.42227

HUMAnN uses the taxonomic profile to select pangenomes for the custom ChocoPhlAn database from the full ChocoPhlAn database. For example, the first line in the example above will add the centriods from Bacteroides thetaiotaomicron to the custom ChocoPhlAn database. Please note the taxa in the custom taxonomic profile must match the naming convention used by the full ChocoPhlAn database (ignoring case).

To run HUMAnN with the custom taxonomic profile ($FILE), use the option "--taxonomic-profile $FILE". This will bypass running MetaPhlAn, which creates a taxonomic profile, and instead will use the custom taxonomic profile provided. From the custom taxonomic profile, only those taxa that have a percent abundance greater than the default prescreen threshold will be considered. To change the default setting for the prescreen threshold to $THRESHOLD, use the option "--prescreen-threshold $THRESHOLD".


Custom nucleotide reference database

A custom nucleotide reference database can be provided to HUMAnN

This custom database must be formatted as a bowtie2 index.

Please see the Custom reference database annotations section for information on database annotations. Also please note, only alignments to genes included in the pathways databases will be considered in the pathways computations. The pathways databases included with HUMAnN are for alignments to UniRef gene families. If you would like to create custom pathways databases for a different set of gene families, please see the Custom pathways database section for more information.

To run HUMAnN with your custom nucleotide reference database (located in $DIR), use the option "--bypass-nucleotide-index" and provide the custom database as the ChocoPhlAn option with "--nucleotide-database $DIR". If you would like to bypass the translated alignment portion of HUMAnN, add the option "--bypass-translated-search".


Custom protein reference database

A custom protein reference database can be provided to HUMAnN

This custom database must be formatted to be used by the translated alignment software selected (the default is Diamond).

Please see the Custom reference database annotations section for information on database annotations. Also please note, only alignments to genes included in the pathways databases will be considered in the pathways computations. The pathways databases included with HUMAnN are for alignments to UniRef gene families. If you would like to create custom pathways databases for a different set of gene families, please see the Custom pathways database section for more information.

To run HUMAnN with your custom protein reference database (located in $DIR), provide the custom database as the UniRef option with "--protein-database $DIR". Please note, HUMAnN will run on all of the databases in this folder ($DIR) which have been formatted to be used by the translated alignment software selected. Also if you would like to bypass the nucleotide alignment portion of HUMAnN, add the option "--bypass-nucleotide-search".

Note: HUMAnN can work with a protein database that's been split into chunks for computational efficiency (mapping the reads to them serially). Therefore HUMAnN models a protein database as a folder with one or more database chunks in it (derived from the same input set of protein sequences). If you mix unrelated chunks in the same folder HUMAnN will (correctly) complain at you. Hence if you want to have two separate DBs, even if each is only represented by a single chunk, they should be in separate folders.


Custom reference database annotations

The annotations for sequences in a custom (nucleotide or protein) reference database can be as follows (delimited by "|"):

  1. gene_family
  2. gene_family|gene_length
  3. gene_family|gene_length|taxonomy
  4. identifier

For options 1 and 2, HUMAnN defaults will be used. The default gene length is 1,000 bases and the default taxonomy is "unclassified".

Option 4 should be used along with a custom reference database annotation file. The custom reference database annotation file maps the identifiers to annotations. This file must be in a tab-delimited format and contain at least two columns (identifier and gene family). At most four columns of information can be included to describe each reference sequence. These columns should be organized as identifier, gene family, gene length, and taxonomy. An example follows:

256402719   UniRef50_C9LQU5 147 g__Dialister.s__Dialister_invisus
479150083   UniRef50_R6U703 540 g__Ruminococcus.s__Ruminococcus_bromii
423220654   UniRef50_I8UUJ6 1218    g__Bacteroides.s__Bacteroides_caccae

The first line of the example will use the gene family UniRef50_C9LQU5, gene length 147, and taxon g__Dialister.s__Dialister_invisus for any sequences in your reference databases with the identifier 256402719.

To run HUMAnN with the custom reference database annotations ($FILE), use the option "--id-mapping $FILE".


Custom pathways database

The pathways databases included with HUMAnN (from MetaCyc and UniProt) have been created to be used with alignments to UniRef gene families. A custom pathways database can be provided to HUMAnN, specifically made to work with your custom reference database(s).

One or two pathways database files (in a comma-delimited list) can be provided to HUMAnN with the option "--pathways-database $FILE". If two files are provided, the first file provides a tab-delimited mapping while the second file provides the pathways mapping. For example, the first file could provide a mapping of gene families to reactions while the second file maps reactions to pathways.

The first file, which is optional, should be organized to include at least two columns per line. The first column is the item to be mapped to (ie reactions from the example) while the remaining columns in the row are the gene families which can be mapped to the item in the first column. An example follows:

RXN-123 UniRef50_A0B6Z6 UniRef50_A3CRP6
RXN-456 UniRef50_A2RVM0 UniRef50_A4IGM4 UniRef50_A6NKP2 UniRef50_B8H806

The pathways file is a tab-delimited file with the first column the name of the pathway. It can be in a structured or unstructured format. In a structured format, the second column includes the items (space-delimited) contained in the pathway. The structure follows the same definition as that for Kegg modules. Each structure is a list (space-delimited) of items with a comma to indicate alternatives, a plus to indicate a complex, and a minus sign at the beginning of an item to indicate this is not a key item in the pathway. These items are gene families if only one pathways file is provided. If two files are provided, as in the example, these items would be reactions. In an unstructured format, the second column and any remaining columns in the row are the items that map to the pathway. For example, if a pathway contained four genes the line for this pathway in an unstructured format file would contain five columns. The first column is the pathway name and the remaining columns would each contain a single gene.

An example of a structured pathways file follows:

PWY-1   A B ( C , D )
PWY-2   ( ( A + B ) , ( C + D ) ) E

An example of an unstructured pathways file follows:

PWY-3   A   B   C   D
PWY-4   A   B   C   D   E

Core diversity analysis with QIIME

HUMAnN output files can be provided to QIIME as input to run core diversity analysis. For information on the QIIME software, please see the project website. For questions about QIIME, please contact the QIIME user group. The instructions that follow are for QIIME version 1.9.1.

To run this analysis, run the following steps:

  1. Run HUMAnN on each of the samples (replacing $OUTPUT_DIR with the full path to the folder to write the output)

    • For each $SAMPLE.fastq in the set of all samples
      • $ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --output-format biom --remove-stratified-output --output-max-decimals 0
      • Each sample will have two main output files ($SAMPLE_2_genefamilies.tsv, $SAMPLE_4_pathabundance.tsv) in biom format.
  2. Merge the three output files for each sample into three output files for all samples using QIIME's merge_otu_tables.py

    • For this example, assume there are 3 samples named $SAMPLE1.fastq, $SAMPLE2.fastq, and $SAMPLE3.fastq
      • $ merge_otu_tables.py -i $SAMPLE1_2_genefamilies.biom,$SAMPLE2_2_genefamilies.biom,$SAMPLE3_2_genefamilies.biom -o genefamilies_all.biom
      • $ merge_otu_tables.py -i $SAMPLE1_4_pathabundance.biom,$SAMPLE2_4_pathabundance.biom,$SAMPLE3_4_pathabundance.biom -o pathabundance_all.biom
  3. For each of the merged biom files, run QIIME's biom summarize-table to obtain the sampling depth (e-value) required as input for the next step.

    • $ biom summarize-table -i genefamilies_all.biom -o genefamilies_summary.txt
    • $ biom summarize-table -i pathabundance_all.biom -o pathabundance_summary.txt
  4. Next run each merged biom file through QIIME's core_diversity_analyses.py, providing the e-value from the prior step (replacing $EVALUE for each input file) and the QIIME mapping file you created (replacing $MAPPING_FILE with the full path to the mapping file).

    • $ core_diversity_analyses.py -i genefamilies_all.biom -o core_diversity_genefamilies -m $MAPPING_FILE --nonphylogenetic_diversity -e $EVALUE --suppress_taxa_summary
    • $ core_diversity_analyses.py -i pathabundance_all.biom -o core_diversity_pathabundance -m $MAPPING_FILE --nonphylogenetic_diversity -e $EVALUE --suppress_taxa_summary

The output of QIIME's core_diversity_analyses.py will include PCoA plots of beta diversity analyses which can be viewed in the beta diversity folder, in the bray_curtis_emperor_pcoa_plot folder. Click on index.html, which will open a web browser with the plot. Using the tools on the right side, it is possible to change the colors by which the samples are labeled.

Under the color tab, there are two variables that can be changed:

  1. the actual colors (Classic QIIME Colors is the default)
  2. the mapping file category by which the samples are labeled.

The more categories in your mapping file, the more options you have to see if your samples are separating based on that feature.


Metatranscriptome analysis with HUMAnN

The recommended HUMAnN metatranscriptome (RNA) analysis protocol differs depending on whether or not you have metagenomic (DNA) reads available from the same biological sample.

Analyzing a metatranscriptome with a paired metagenome. In this case, we recommend analyzing the metagenome first. Then, rather than constructing a taxonomic profile directly from the metatranscriptome, use the taxonomic profile of the corresponding metagenome as an additional input to HUMAnN via the --taxonomic-profile flag. This will guarantee that RNA reads are mapped to any species' pangenomes detected in the metagenome. RNA reads are otherwise provided as input to HUMAnN just as DNA reads are. HUMAnN RNA-level outputs (e.g. transcript family abundance) can then be normalized by corresponding DNA-level outputs to quantify microbial expression independent of gene copy number. CAVEAT: For low-abundance species, random sampling may lead to detection of transcripts for undetected genes. In these cases, we recommend smoothing DNA-level features to avoid divide-by-zero errors during normalization.

Analyzing an isolated metatranscriptome (without a paired metagenome). In this case, analyze RNA read data just as you would DNA data (provided as a fasta/fastq file). CAVEAT 1: Note that species are quantified in HUMAnN based on recruitment of reads to species-specific marker genes. While each genome copy is assumed to donate ~1 copy of each marker to metagenome (DNA) data, the same assumption cannot be made for RNA data (markers may be more or less transcribed within a species compared to the species average). As long as a non-trivial fraction of a species' markers are expressed, HUMAnN will still detect that species in the transcript pool. However, species relative abundance estimates from the taxonomic profile must be interpretted carefully: these values reflect species' relative contributions to the pool of species-specific transcripts, and not the overall transcript pool. CAVEAT 2: Transcript abundance inferred from a lone metatranscriptome is confounded with underlying gene copy number. For example, transcript X may be more abundant in sample A relative to sample B because (i) the same number of underlying X genes are more highly expressed in sample A relative to sample B or (ii) there are more copies of gene X in sample A relative to sample B (all of which are equally expressed). This is a general challenge in analyzing isolated metatranscriptomes (not specific to HUMAnN).


Genus level gene families and pathways

By default, the gene families and pathways output files from HUMAnN are species level. To obtain genus level gene families and pathways, follow these steps.

  1. Create a genus level gene families file

    • $ humann_gene_families_genus_level --input $SAMPLE_2_genefamilies.tsv --output $SAMPLE_genefamilies_genus_level.tsv
    • In this command, replace $SAMPLE_2_genefamilies.tsv with the species level gene families file created by default by HUMAnN and $SAMPLE_genefamilies_genus_level.tsv with the name of the gene families genus level file that will be created.
  2. Run HUMAnN, with the genus level gene families file as input, to get genus level pathways output files

    • $ humann --input $SAMPLE_genefamilies_genus_level.tsv --output humann_genus_level_output
    • This run will be much faster and require less memory than the original run as HUMAnN is provided gene family abundances so it only needs to compute the pathways.

FAQs

HUMAnN frequently asked questions:

  1. Is there a way to print more information to stdout during the run?
    • Yes, add the --verbose flag
  2. How do I make use of multiple cores on the same machine?
    • Add the --threads $CORES option
  3. How do I remove the intermediate temp output files?
    • Add the --remove-temp-output flag
  4. Can I provide an alternative location for the ChocoPhlAn database?
    • Yes, use the --nucleotide-database $DIR option
  5. Can I provide an alternative location for the UniRef database?
    • Yes, use the --protein-database $DIR option
  6. I already have MetaPhlAn output. Can I start HUMAnN with the MetaPhlAn output?
    • Yes, use the --taxonomic-profile file.tsv option
  7. Is there a way to change $SAMPLENAME in the output file names?
    • Yes, use the --output-basename $NAME option
  8. How do I remove the stratification by bug from the output files?
    • Add the --remove-stratified-output flag
  9. How do I run with the unipathways databases?
    • Add the --pathways unipathway option
  10. Is there a way to output files in biom format?
    • Yes, use the --output-format biom option
  11. Can I change the identity threshold for alignments?
    • Yes, use the --identity-threshold <50.0> option
  12. Can I change the options provided to MetaPhlAn?
    • Yes, use the --metaphlan-options="-t rel_ab" option
    • Please note the special formatting required for this option

Complete option list

usage: humann [-h] -i <input.fastq> -o <output> [--threads <1>] [--version]
              [-r] [--bypass-nucleotide-index] [--bypass-nucleotide-search]
              [--bypass-prescreen] [--bypass-translated-search]
              [--taxonomic-profile <taxonomic_profile.tsv>]
              [--memory-use {minimum,maximum}]
              [--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}]
              [--metaphlan <metaphlan>]
              [--metaphlan-options <metaphlan_options>]
              [--prescreen-threshold <0.5>] [--bowtie2 <bowtie2>]
              [--bowtie-options <bowtie_options>]
              [--nucleotide-database <nucleotide_database>]
              [--nucleotide-identity-threshold <0.0>]
              [--nucleotide-query-coverage-threshold <90.0>]
              [--nucleotide-subject-coverage-threshold <50.0>]
              [--diamond <diamond>] [--diamond-options <diamond_options>]
              [--evalue <1.0>] [--pick-frames {on,off}]
              [--protein-database <protein_database>]
              [--translated-alignment {diamond}]
              [--translated-identity-threshold <50.0>]
              [--translated-query-coverage-threshold <90.0>]
              [--translated-subject-coverage-threshold <50.0>]
              [--gap-fill {on,off}] [--minpath {on,off}]
              [--pathways {metacyc,unipathway}]
              [--pathways-database <pathways_database.tsv>] [--xipe {on,off}]
              [--annotation-gene-index <3>] [--id-mapping <id_mapping.tsv>]
              [--remove-temp-output]
              [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
              [--o-log <sample_0.log>] [--output-basename <sample_name>]
              [--output-format {tsv,biom}] [--output-max-decimals <10>]
              [--remove-column-description-output]
              [--remove-stratified-output]

HUMAnN : HMP Unified Metabolic Analysis Network 

optional arguments:
  -h, --help            show this help message and exit

[0] Common settings:
  -i <input.fastq>, --input <input.fastq>
                        input file of type {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom} 
                        [REQUIRED]
  -o <output>, --output <output>
                        directory to write output files
                        [REQUIRED]
  --threads <1>         number of threads/processes
                        [DEFAULT: 1]
  --version             show program's version number and exit

[1] Workflow refinement:
  -r, --resume          bypass commands if the output files exist
  --bypass-nucleotide-index
                        bypass the nucleotide index step and run on the indexed ChocoPhlAn database
  --bypass-nucleotide-search
                        bypass the nucleotide search steps
  --bypass-prescreen    bypass the prescreen step and run on the full ChocoPhlAn database
  --bypass-translated-search
                        bypass the translated search step
  --taxonomic-profile <taxonomic_profile.tsv>
                        a taxonomic profile (the output file created by metaphlan)
                        [DEFAULT: file will be created]
  --memory-use {minimum,maximum}
                        the amount of memory to use
                        [DEFAULT: minimum]
  --input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
                        the format of the input file
                        [DEFAULT: format identified by software]
  -v, --verbose         additional output is printed

[2] Configure tier 1: prescreen:
  --metaphlan <metaphlan>
                        directory containing the MetaPhlAn software
                        [DEFAULT: $PATH]
  --metaphlan-options <metaphlan_options>
                        options to be provided to the MetaPhlAn software
                        [DEFAULT: "-t rel_ab"]
  --prescreen-threshold <0.5>
                        minimum estimated genome coverage for inclusion in pangenome search
                        [DEFAULT: 0.5]

[3] Configure tier 2: nucleotide search:
  --bowtie2 <bowtie2>   directory containing the bowtie2 executable
                        [DEFAULT: $PATH]
  --bowtie-options <bowtie_options>
                        options to be provided to the bowtie software
                        [DEFAULT: "--very-sensitive"]
  --nucleotide-database <nucleotide_database>
                        directory containing the nucleotide database
                        [DEFAULT: /home/ljmciver/miniconda3/envs/humann3/lib/python3.6/site-packages/humann/data/chocophlan_DEMO]
  --nucleotide-identity-threshold <0.0>
                        identity threshold for nuclotide alignments
                        [DEFAULT: 0.0]
  --nucleotide-query-coverage-threshold <90.0>
                        query coverage threshold for nucleotide alignments
                        [DEFAULT: 90.0]
  --nucleotide-subject-coverage-threshold <50.0>
                        subject coverage threshold for nucleotide alignments
                        [DEFAULT: 50.0]

[3] Configure tier 2: translated search:
  --diamond <diamond>   directory containing the diamond executable
                        [DEFAULT: $PATH]
  --diamond-options <diamond_options>
                        options to be provided to the diamond software
                        [DEFAULT: "--top 1 --outfmt 6"]
  --evalue <1.0>        the evalue threshold to use with the translated search
                        [DEFAULT: 1.0]
  --protein-database <protein_database>
                        directory containing the protein database
                        [DEFAULT: /home/ljmciver/miniconda3/envs/humann3/lib/python3.6/site-packages/humann/data/uniref_DEMO]
  --translated-alignment {diamond}
                        software to use for translated alignment
                        [DEFAULT: diamond]
  --translated-identity-threshold <50.0>
                        identity threshold for translated alignments
                        [DEFAULT: 50.0]
  --translated-query-coverage-threshold <90.0>
                        query coverage threshold for translated alignments
                        [DEFAULT: 90.0]
  --translated-subject-coverage-threshold <50.0>
                        subject coverage threshold for translated alignments
                        [DEFAULT: 50.0]

[5] Gene and pathway quantification:
  --gap-fill {on,off}   turn on/off the gap fill computation
                        [DEFAULT: on]
  --minpath {on,off}    turn on/off the minpath computation
                        [DEFAULT: on]
  --pathways {metacyc,unipathway}
                        the database to use for pathway computations
                        [DEFAULT: metacyc]
  --pathways-database <pathways_database.tsv>
                        mapping file (or files, at most two in a comma-delimited list) to use for pathway computations
                        [DEFAULT: metacyc database ]
  --xipe {on,off}       turn on/off the xipe computation
                        [DEFAULT: off]
  --annotation-gene-index <3>
                        the index of the gene in the sequence annotation
                        [DEFAULT: 3]
  --id-mapping <id_mapping.tsv>
                        id mapping file for alignments
                        [DEFAULT: alignment reference used]

[6] More output configuration:
  --remove-temp-output  remove temp output files
                        [DEFAULT: temp files are not removed]
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        level of messages to display in log
                        [DEFAULT: DEBUG]
  --o-log <sample_0.log>  log file
                        [DEFAULT: sample_0.log]
  --output-basename <sample_name>
                        the basename for the output files
                        [DEFAULT: input file basename]
  --output-format {tsv,biom}
                        the format of the output files
                        [DEFAULT: tsv]
  --output-max-decimals <10>
                        the number of decimals to output
                        [DEFAULT: 10]
  --remove-column-description-output
                        remove the description in the output column
                        [DEFAULT: output column includes description]
  --remove-stratified-output
                        remove stratification from output
                        [DEFAULT: output is stratified]

Contributions

Thanks go to these wonderful people:

Support

Please sign up for the HUMAnN category in bioBakery Forum if any questions or concerns.

Additionally, Google user group: humann-users@googlegroups.com (Read only) is available.