Before opening a new issue here, please check the appropriate help channel on the bioBakery Support Forum (https://forum.biobakery.org) and consider opening or commenting on a thread there.
HUMAnN is the next generation of HUMAnN (HMP Unified Metabolic Analysis Network).
If you use HUMAnN in your work, please cite the HUMAnN 3.0 paper:
Francesco Beghini1 ,Lauren J McIver2 ,Aitor Blanco-Mìguez1 ,Leonard Dubois1 ,Francesco Asnicar1 ,Sagun Maharjan2,3 ,Ana Mailyan2,3 ,Andrew Maltez Thomas1 ,Paolo Manghi1 ,Mireia Valles-Colomer1 ,George Weingart2,3 ,Yancong Zhang2,3 ,Moreno Zolfo1 ,Curtis Huttenhower2,3 ,Eric A Franzosa2,3 ,Nicola Segata1,4
1 Department CIBIO, University of Trento, Italy
2 Harvard T. H. Chan School of Public Health, Boston, MA, USA
3 The Broad Institute of MIT and Harvard, Cambridge, MA, USA
4 IEO, European Institute of Oncology IRCCS, Milan, Italy
And feel free to link to HUMAnN in your Methods:
http://huttenhower.sph.harvard.edu/humann
For additional information, read the HUMAnN Tutorial
HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. More generally, functional profiling answers the question "What are the microbes in my community-of-interest doing (or capable of doing)?"
Community functional profiles stratified by known and unclassified organisms
Considerably expanded databases of genomes, genes, and pathways
A simple user interface (single command driven flow)
Accelerated mapping of reads to reference databases (including run-time generated databases tailored to the input)
There are four different types of files that can be provided as input to HUMAnN. By default HUMAnN will determine the type of the file. As shown in the figure below, the type of input file will determine where HUMAnN will start the workflow. Files of type 2, 3, and 4 will begin the workflow after the alignment steps.
File Types:
There are multiple bypass options that will allow you to adjust the standard workflow.
Bypass options:
HUMAnN includes a "--resume" option which will allow you to bypass alignment steps which have already been completed. For example, if you originally ran with a bypass option you can run just the step you bypassed with "--resume". This will only run the alignment step you bypassed and then recompute the gene families and pathways.
When using the "--resume" option, the following steps will be bypassed if they have already been completed:
Please install the required software in a location in your $PATH
or provide the location with an optional argument to HUMAnN 3.0.
For example, the location of the Bowtie2 install ($BOWTIE2_DIR) can be provided with "--bowtie2 $BOWTIE2_DIR".
Some of the required dependencies are installed automatically when installing HUMAnN with pip. If these dependencies do not appear to be installed after installing HUMAnN with pip, it might be that your environment is setup to use wheels instead of installing from source. HUMAnN must be installed from source for it to also be able to install dependencies. To force pip to install HUMAnN from source add one of the following options to your install command, "--no-use-wheel" or "--no-binary :all:".
If you always run with input files of type 2, 3, and 4 (for information on input file types, see section Workflow by input file type), MetaPhlAn2, Bowtie2, and Diamond are not required. Also if you always run with one or more bypass options (for information on bypass options, see section Workflow by bypass mode), the software required for the steps you bypass does not need to be installed.
If always running with files of type 2, 3, and 4 (for information on file types, see section Workflow by input file type), less disk space is required.
You can download the latest HUMAnN release or the development version. The source contains example files. If installing with pip, it is optional to first download the HUMAnN source.
Option 1: Latest Release (Recommended)
Option 2: Development Version
Create a clone of the repository:
$ git clone https://github.com/biobakery/humann
Note: Creating a clone of the repository requires Github to be installed.
$ pip install humann
humann: command not found
when trying to run HUMAnN after installing with the "--user" option.Move to the HUMAnN directory
$ cd $HUMAnN_PATH
Install HUMAnN
$ python setup.py install
--user
to the install command. This will install the python package into subdirectories of '\~/.local'. Please note when using the "--user" install option on some platforms, you might need to add '\~/.local/bin/' to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message humann: command not found
when trying to run HUMAnN after installing with the "--user" option.$ humann_test
With HUMAnN installed you can try out a demo run using reduced versions of the databases. If installing from pip, please also download the source as it contains the demo examples.
$ humann --input examples/demo.fastq --output $OUTPUT_DIR
Output from this demo run will be written to the folder $OUTPUT_DIR.
Please continue with the install directions to download the full databases before running with your sequencing data.
Downloading the databases is a required step if your input is a filtered shotgun sequencing metagenome file (fastq, fastq.gz, fasta, or fasta.gz format). If your input files will always be mapping results files (sam, bam or blastm8 format) or gene tables (tsv or biom format), you do not need to download the ChocoPhlAn and translated search databases.
Download the ChocoPhlAn database providing $INSTALL_LOCATION as the location to install the database (approximate size = 16.4 GB).
$ humann_databases --download chocophlan full $INSTALL_LOCATION
NOTE: The humann config file will be updated to point to this location for the default chocophlan database. If you move this database, please use the "humann_config" command to update the default location of this database. Alternatively you can always provide the location of the chocophlan database you would like to use with the "--nucleotide-database
While there is only one ChocoPhlAn database, users have a choice of translated search databases which offer trade-offs between resolution, coverage, and performance. For help selecting a translated search database, see the following guides:
Download a translated search database providing $INSTALL_LOCATION as the location to install the database:
To download the full UniRef90 database (20.7GB, recommended):
$ humann_databases --download uniref uniref90_diamond $INSTALL_LOCATION
To download the EC-filtered UniRef90 database (0.9GB):
$ humann_databases --download uniref uniref90_ec_filtered_diamond $INSTALL_LOCATION
To download the full UniRef50 database (6.9GB):
$ humann_databases --download uniref uniref50_diamond $INSTALL_LOCATION
To download the EC-filtered UniRef50 database (0.3GB):
$ humann_databases --download uniref uniref50_ec_filtered_diamond $INSTALL_LOCATION
NOTE: The humann config file will be updated to point to this location for the default uniref database. If you move this database, please use the humann_config
command to update the default location of this database. Alternatively you can always provide the location of the uniref database you would like to use with the --protein-database <uniref>
option to humann.
Note: Please make sure that all the databases in the directory are of same version. Different versions of database must be separated in different directory.
If you have already installed HUMAnN, using the Initial Installation steps, and would like to upgrade your installed version to the latest version, please follow these steps.
Since you have already downloaded the databases in the initial installation, you do not need to download the databases again unless there are new versions available. However, you will want to update your latest HUMAnN install to point to the databases you have downloaded as by default the new install configuration will point to the demo databases.
To update your HUMAnN configuration file to include the locations of your downloaded databases, please use the following steps.
Update the location of the ChocoPhlAn database (replacing $DIR with the full path to the directory containing the ChocoPhlAn database)
$ humann_config --update database_folders nucleotide $DIR
Update the location of the UniRef database (replacing $DIR with the full path to the directory containing the UniRef database)
$ humann_config --update database_folders protein $DIR
Please note, after a new installation, all of the settings in the configuration file, like the database folders, will be reset to the defaults. If you have any additional settings that differ from the defaults, please update them at this time.
For more information on the HUMAnN configuration file, please see the Configuration section.
$ humann --input $SAMPLE --output $OUTPUT_DIR
$SAMPLE
= a single file that is one of the following types:
$OUTPUT_DIR
= the output directory
Five main output files will be created:
$OUTPUT_DIR/$SAMPLENAME_0.log
$OUTPUT_DIR/$SAMPLENAME_1_metaphlan_profile.tsv
$OUTPUT_DIR/$SAMPLENAME_2_genefamilies.tsv
$OUTPUT_DIR/$SAMPLENAME_3_reactions.tsv
$OUTPUT_DIR/$SAMPLENAME_4_pathabundance.tsv
where $SAMPLENAME
is the basename of $SAMPLE
.
In HUMAnN v3 the log file and the taxonomic profile are included in the intermediate temp folder. Also in HUMAnN v3 the files do not include numbers in their names.
*The gene families file will not be created if the input file type is a gene table.
Intermediate temp files will also be created:
$DIR/$SAMPLENAME_bowtie2_aligned.sam
$DIR/$SAMPLENAME_bowtie2_aligned.tsv
$DIR/$SAMPLENAME_bowtie2_index
$DIR/$SAMPLENAME_bowtie2_unaligned.fa
$DIR/$SAMPLENAME_custom_chocophlan_database.ffn
$DIR/$SAMPLENAME_metaphlan_bowtie2.txt
$DIR/$SAMPLENAME_$TRANSLATEDALIGN_aligned.tsv
$DIR/$SAMPLENAME_$TRANSLATEDALIGN_unaligned.fa
$DIR=$OUTPUT_DIR/$SAMPLENAME_humann_temp/
$SAMPLENAME
is the basename of the fastq/fasta input file$TRANSLATEDALIGN
is the translated alignment software selected (default: DIAMOND)NOTE: $SAMPLENAME
can be set by the user with the option --output-basename <$NEWNAME>
.
The examples folder contains four demo example input files. These files are of fasta, fastq, sam, and blastm8 format. Blastm8 format is created by the following software: diamond.
To run the fasta demo:
$ humann --input examples/demo.fasta --output $OUTPUT_DIR
To run the fastq demo:
$ humann --input examples/demo.fastq --output $OUTPUT_DIR
To run the sam demo:
$ humann --input examples/demo.sam --output $OUTPUT_DIR
To run the blastm8 demo:
$ humann --input examples/demo.m8 --output $OUTPUT_DIR
$OUTPUT_DIR
is the output directory
Since sam and blastm8 are mapping results, using these files as input to HUMAnN will bypass both the nucleotide and translated mapping portions of the flow.
The standard workflow involves running HUMAnN on each filtered shotgun sequencing metagenome file, normalizing, and then merging the output files.
Prior to running the workflow, filter your shotgun sequencing metagenome file. We recommend using KneadData for metagenome quality control including removal of host reads.
To run the standard workflow, follow these steps:
Run HUMAnN on each of your filtered fastq files in $INPUT_DIR
placing the results in $OUTPUT_DIR
$SAMPLE.fastq
in $INPUT_DIR
$ humann --input $SAMPLE.fastq --output $OUTPUT_DIR
$SAMPLE.fastq
with the name of the fastq input file$OUTPUT_DIR
with the full path to the folder to write output$SAMPLE_2_genefamilies.tsv
and $SAMPLE_4_pathabundance.tsv
. Normalize the abundance output files
$SAMPLE_2_genefamilies.tsv
in $OUTPUT_DIR
$ humann_renorm_table --input $SAMPLE_2_genefamilies.tsv --output $SAMPLE_2_genefamilies_relab.tsv --units relab
Join the output files (gene families and abundance) from the HUMAnN runs from all samples into three files
$ humann_join_tables --input $OUTPUT_DIR --output humann_2_genefamilies.tsv --file_name genefamilies_relab
$ humann_join_tables --input $OUTPUT_DIR --output humann_4_pathabundance.tsv --file_name pathabundance_relab
$OUTPUT_DIR
with the full path to the folder containing the HUMAnN output files.humann_2_genefamilies.tsv
and humann_4_pathabundance.tsv
.When HUMAnN is run, three main output files will be created (where $SAMPLENAME = the basename of $SAMPLE
):
# Gene Family $SAMPLENAME_Abundance-RPKs
READS_UNMAPPED 187.0
UniRef50_unknown 150.0
UniRef50_unknown|g__Bacteroides.s__Bacteroides_fragilis 150.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon 67.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_fragilis 57.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_finegoldii 5.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_stercoris 4.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|unclassified 1.0
UniRef50_O83668: Fructose-bisphosphate aldolase 60.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_vulgatus 31.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_thetaiotaomicron 22.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_stercoris 7.0
$OUTPUT_DIR/$SAMPLENAME_2_genefamilies.tsv
( $OUTPUT_DIR/$SAMPLENAME_genefamilies.tsv
in HUMAnN v3 ) # Reaction $SAMPLENAME_Abundance
READS_UNMAPPED 7573721.0000000000
UNGROUPED 22978726.4009404778
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus 1961355.5951347812
UNGROUPED|g__Bacteroides.s__Bacteroides_uniformis 1579305.9455720731
UNGROUPED|unclassified 5100980.1244789278
2.7.13.3-RXN 75608.6164459985
2.7.13.3-RXN|g__Bacteroides.s__Bacteroides_uniformis 7962.2303221576
2.7.13.3-RXN|g__Faecalibacterium.s__Faecalibacterium_prausnitzii 5371.8462336209
2.7.13.3-RXN|unclassified 12529.4324849206
$OUTPUT_DIR/$SAMPLENAME_3_reactions.tsv
( file not written in HUMAnN v3 )# Pathway $SAMPLENAME_Abundance
UNMAPPED 140.0
UNINTEGRATED 87.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae 23.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_finegoldii 20.0
UNINTEGRATED|unclassified 12.0
PWY0-1301: melibiose degradation 57.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_caccae 32.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_finegoldii 4.5
PWY0-1301: melibiose degradation|unclassified 3.0
PWY-5484: glycolysis II (from fructose-6P) 54.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_caccae 16.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_finegoldii 8.0
PWY-5484: glycolysis II (from fructose-6P)|unclassified 6.0
$OUTPUT_DIR/$SAMPLENAME_4_pathabundance.tsv
( $OUTPUT_DIR/$SAMPLENAME_pathabundance.tsv
in HUMAnN v3 )#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 100.0
k__Bacteria|p__Bacteroidetes 2|11 73.86802
k__Bacteria|p__Firmicutes 2|22 26.13198
k__Bacteria|p__Bacteroidetes|c__Bacteroidia 2|11|111 73.86802
k__Bacteria|p__Firmicutes|c__Clostridia 2|22|222 15.60912
k__Bacteria|p__Firmicutes|c__Negativicutes 2|22|333 10.52286
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 2|11|111|1111 73.86802
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales 2|22|222|4444 15.60912
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales 2|22|333|5555 10.52286
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 2|11|111|1111|77777 51.32768
$OUTPUT_DIR/$SAMPLENAME_1_metaphlan_profile.tsv
( $DIR/$SAMPLENAME_metaphlan_profile.tsv
in HUMAnN v3)Ten intermediate temp output files will be created where:
$DIR = $OUTPUT_DIR/$SAMPLENAME_humann_temp/
$SAMPLENAME = basename of the fastq/fasta input file named $SAMPLE
$TRANSLATEDALIGN = translated alignment software selected (diamond)
NOTE: $SAMPLENAME
can be set by the user with the option --output-basename <$NEWNAME>
@HD VN:1.0 SO:unsorted
@SQ SN:g__Ruminococcus.s__Ruminococcus_bromii|UniRef90_D4L6K4|UniRef50_R6U703 LN:540
r99491 0 g__Bacteroides.s__Bacteroides_stercoris|UniRef90_R6B629|UniRef50_R5RCC8 1015 42 151M * 0 0 CTGACCGATTTATATGAGGA zzzzzzzzzzzzzzzzzzzz
r99526 0 g__Parabacteroides.s__Parabacteroides_merdae|UniRef90_unknown|UniRef50_D9RX34 155 42 151M * 0 0 AATTTTCTTCAAAAAATATA zzzzzzzzzzzzzzzzzzzz
r99581 16 g__Bacteroides.s__Bacteroides_stercoris|UniRef90_unknown|UniRef50_R6SXR7 2503 42 151M * 0 0 TCTTTTATGCAGGGGATATG zzzzzzzzzzzzzzzzzzzz
$DIR/$SAMPLENAME_bowtie2_aligned.sam
@
) are as follows:
r93 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 99.0 100.0 0
r113 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 99.0 100.0 0
r704 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 99.0 100.0 0
r663 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 99.0 100.0 0
r940 gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468 99.0 100.0 0
$DIR/$SAMPLENAME_bowtie2_aligned.tsv
$DIR/$SAMPLENAME_bowtie2_index*
>r4370
GGCGGACGATCTTGTCGCCCAGCCTGTAGCCTTTCTGGTACACCGTGATGACGGTGCCGCTCTCCTGCCCGTCCGTGGCGGGGATCTGCTGG
>r4398
TGCCCGGACAGGATCTTCTCTTTCGTACCGGGCATCATCTGCTCCATGATCTCCACGCCTCGCATGAACTTTTCAGAACGGGCAACGTAGGA
$DIR/$SAMPLENAME_bowtie2_unaligned.fa
>gi|479150083|ref|NC_021013.1|:976220-976759|40518|g__Ruminococcus.s__Ruminococcus_bromii|UniRef90_D4L6K4|UniRef50_R6U703
ATGTTCTATGTATTTCTTGCAGAAGGCTTTGAAGAAACAGAGGCGCTTGCCCCCGTTGATGTAATGCGCAGGGCAAAGCT
TGATGTTAAAACAGTCGGTGTAACAGGCGAATGTGTTACAAGCTCACACGGTGTGCCTGTAAAAGCCGATATCACAATTG
ACAATATTGACCTTGACGATGTTCAGGGTGTTGTACTCCCCGGTGGTATGCCCGGAACTCTCAATCTTGAGGCAAACAAA
AAGGTTCTTGAGGCTGTTAAGTATAGCTGTGAAAACGGCAAAATCGTTGCCGCAATCTGTGCCGCTCCGTCAATTCTCGG
$DIR/$SAMPLENAME_custom_chocophlan_database.ffn
r113 gi|224485636|ref|NZ_EQ973490.1|:c728571-728107
r559 gi|479185170|ref|NC_021030.1|:c1678719-1677127
r663 gi|512436175|ref|NZ_KE159463.1|:c142391-139122
r704 gi|423310881|ref|NZ_JH724270.1|:c220428-218656
r1086 gi|238922432|ref|NC_012781.1|:c1988048-1987140
$DIR/$SAMPLENAME_metaphlan_bowtie2.txt
r2805 UniRef50_E2ZJD8|627 37.50 48 30 0 147 4 152 199 5e-06 40.0
r2805 UniRef50_E2ZHU1|963 37.50 48 30 0 147 4 152 199 7e-06 39.7
r2805 UniRef50_K1UMQ9|612 35.42 48 31 0 147 4 35 82 2e-05 38.5
r3036 UniRef50_K1TCN4|540 100.00 22 0 0 150 85 148 169 8e-10 52.8
r3036 UniRef50_UPI00046A4B12|696 35.42 48 30 1 149 6 88 134 1e-05 38.9
$DIR/$SAMPLENAME_$TRANSLATEDALIGN_aligned.tsv
>r4370
GGCGGACGATCTTGTCGCCCAGCCTGTAGCCTTTCTGGTACACCGTGATGACGGTGCCGCTCTCCTGCCCGTCCGTGGCGGGGATCTGCTGG
>r4398
TGCCCGGACAGGATCTTCTCTTTCGTACCGGGCATCATCTGCTCCATGATCTCCACGCCTCGCATGAACTTTTCAGAACGGGCAACGTAGGA
$DIR/$SAMPLENAME_$TRANSLATEDALIGN_unaligned.fa
03/16/2015 01:09:52 PM - humann.utilities - INFO: File ( demo.fastq ) is of format: fastq
03/16/2015 01:09:52 PM - humann.config - INFO: Run config settings:
DATABASE SETTINGS
nucleotide database folder = data/chocophlan_DEMO
protein database folder = data/uniref_DEM
$SAMPLENAME_0.log
( $SAMPLENAME.log
in HUMAnN v3)HUMAnN uses two databases for alignment, ChocoPhlAn and UniRef. There are different formats of these databases. The demo formats of both are included in the HUMAnN install.
To see which databases are currently available, run (with example output included below):
$ humann_databases --available
HUMANnN Databases ( database : build = location )
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.tar.gz
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref50_GO_filtered/uniref50_DEMO_diamond.tar.gz
uniref : diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref50_GO_filtered/uniref50_GO_filtered_diamond.tar.gz
To download a database, run:
$ humann_databases --download $DATABASE $BUILD $INSTALL_LOCATION
This will automatically update the HUMAnN configuration. If you do not have permissions to edit the config file run with the option --update-config no
and then include the locations of the databases when running HUMAnN.
HUMAnN uses a configuration file to store user configuration settings. This configuration file is automatically updated when a database is installed.
To view the current settings of the configuration file, run (with example output included below):
$ humann_config --print
HUMAnN Configuration ( Section : Name = Value )
output_format : remove_stratified_output = False
output_format : output_max_decimals = 10
alignment_settings : prescreen_threshold = 0.5
alignment_settings : evalue_threshold = 1.0
alignment_settings : identity_threshold = 50.0
database_folders : nucleotide = data/chocophlan_DEMO
database_folders : protein = data/uniref_DEMO
run_modes : bypass_nucleotide_search = False
run_modes : verbose = False
run_modes : bypass_nucleotide_index = False
run_modes : bypass_translated_search = False
run_modes : resume = False
run_modes : threads = 1
run_modes : bypass_prescreen = False
To change a value in the configuration file, run:
$ humann_config --update $SECTION $NAME $VALUE
The settings included in the configuration file plus others selected at run-time are printed to the beginning of the log file for each run.
Example log file configuration settings section:
03/16/2015 01:09:52 PM - humann.config - INFO:
Run config settings:
DATABASE SETTINGS
nucleotide database folder = data/chocophlan_DEMO
protein database folder = data/uniref_DEMO
pathways database file 1 = data/pathways/metacyc_reactions.uniref
pathways database file 2 = data/pathways/metacyc_pathways
RUN MODES
resume = False
verbose = True
bypass prescreen = False
bypass nucleotide index = False
bypass nucleotide search = False
bypass translated search = False
translated search = diamond
pick frames = off
threads = 4
ALIGNMENT SETTINGS
evalue threshold = 1.0
prescreen threshold = 0.5
identity threshold = 50.0
PATHWAYS SETTINGS
minpath = on
xipe = off
INPUT AND OUTPUT FORMATS
input file format = fastq
output file format = tsv
output max decimals = 10
remove stratified output = False
log level = DEBUG
$ humann_barplot --input $TABLE.tsv --focal-feature $FEATURE --outfile $FIGURE
$TABLE.tsv
= a stratified HUMAnN output file$FEATURE
= Feature from the table to plot (defaults to first feature)$FIGURE
= Where to save the figure-h
to see additional command line optionshumann_barplot
produces plots of stratified HUMAnN features. Note: unlike many other HUMAnN utilities, humann_barplot
requires the Python scientific stack (notably matplotlib
) to operate. humann_barplot
includes many options for sorting and scaling data, as detailed in the --help
menu. humann_barplot
is also explored in greater detail in the HUMAnN.
Here is an example of a HUMAnN barplot for a pathway (denitrification) that was preferentially enriched in Human Microbiome Project oral samples relative to other body sites. This figure uses many options from humann_barplot
, including regrouping by genus, pseudolog scaling, and sorting samples by similarity and metadata:
Used for viewing and setting HUMAnN defaults. See the Configuration section above for details.
Used for downloading ChocoPhlAn, translated search databases, and HUMAnN utility mapping files. See the Databases section above for details.
See the Genus level gene families and pathways guide below.
$ humann_join_tables --input $INPUT_DIR --output $TABLE
$INPUT_DIR
= a directory containing gene/pathway tables (tsv or biom format)$TABLE
= the file to write the new single gene table (biom format if input is biom format)--file_name $STR
will only join gene tables with $STR in file name-h
to see additional command line optionsThis will join (merge) multiple single-sample output files into a single table with multiple samples. It can also be used to join multiple multi-sample output files into a single table.
Used for collapsing joined MetaPhlAn taxonomic profiles to a single joint profile. See the Joint taxonomic profile guide below.
$ humann_regroup_table --input $TABLE --groups $GROUPS --output $TABLE2
$TABLE
= gene/pathway table (tsv format)$GROUPS
= options for regrouping table features (ex. uniref50 to metacyc reaction)$TABLE2
= regrouped gene/pathway table-h
to see additional command line optionsHUMAnN genefamilies.tsv
output can contain a very large number of features depending on the complexity of your underlying sample. One way to explore this information in a simplified manner is via HUMAnN's own pathway abundance, which summarize the values of their member genes. However, this approach does not apply to gene families that are not associated with metabolic pathways.
To further simplify the exploration of gene family abundance data, users can regroup gene families into other functional categories using humann_regroup_table
. This script takes as arguments a gene family abundance table and a mapping (groups) file that indicates which gene families belong to which groups. Out of the box, HUMAnN can regroup gene families to MetaCyc reactions (a step which is also used internally as part of MetaCyc pathway quantification). Users can download additional mapping files using the HUMAnN databases script:
$ humann_databases --download utility_mapping full $DIR
Where $DIR
is the location where you'd like to store the mapping files. In addition to downloading the mapping files, this command also registers the mapping files with humann_regroup_table
(running the script with the -h
flag will reveal an expanded list of regrouping options).
Mappings are available for both UniRef90 and UniRef50 gene families to the following systems:
In most cases, mappings are directly inferred from the annotation of the corresponding UniRef centroid sequence in UniProt.
One exception to this are the "informative GO" (infogo1000
) maps: These are informative subsets of GO computed from UniProt's annotations and the structure of the GO hierarchy specifically for HUMAnN (each informative GO term has >1,000 UniRef centroids annotated to it, but none of its progeny terms have >1,000 centroids so annotated).
Users are free to create and use additional mapping files and pass them to humann_regroup_table
via the --custom
mapping flag. The format of a mapping file is:
group1 uniref1 uniref2 uniref3 ...
group2 uniref1 uniref5 ...
Where spaces between items above denote TABS
. Mapping files can be gzip- or bzip2-compressed and will still be readable by humann_regroup_table
. By default, feature abundances (such as gene families) are summed to produce group abundances.
If the "UNMAPPED" gene abundance feature is included in a user's input to humann_regroup_table
, it will automatically be carried forward to the final output. In addition, genes that do not group with a non-trivial feature are combined as an "UNGROUPED" group. By default, UNGROUPED reflects the total abundance of genes that did not belong to another group (similar in spirit to the "UNINTEGRATED" value reported in the pathway abundance file).
Some groups are not associated by default with human-readable names. To attach names to a regrouped table, use the humann_rename_table
script. (The "GO" name map can be used for both raw GO and informative GO.)
$ humann_rename_table --input $TABLE --names $NAMES --output $TABLE2
$TABLE
= gene/pathway table (tsv format)$NAMES
= feature type to be renamed (run with "-h" to see included options)$TABLE2
= gene/pathway table with new names attached-h
to see additional command line optionsBy default, larger HUMAnN outputs do not attach names (glosses) to individual feature IDs to keep file sizes down. This script is used to attach those names. The most common use for this script is attaching gene names to UniRef90/UniRef50 features in the default genefamilies.tsv
output files.
Features without names (common for many UniRef90/50 families) are assigned the default name: NO_NAME
.
$ humann_renorm_table --input $TABLE --units $CHOICE --output $TABLE2
$TABLE
= gene/pathway table (tsv format)$CHOICE
= "relab" (relative abundance) or "cpm" (copies per million)$TABLE2
= normalized gene/pathway table-h
to see additional command line optionsHUMAnN quantifies genes and pathways in units of RPKs (reads per kilobase). These account for gene length but not sample sequencing depth. While there are some applications, e.g. strain profiling, where RPK units are superior to depth-normalized units, most of the time a user will renormalize their samples prior to downstream analysis.
This script provides the choice to normalize to relative abundance or copies per million (CPM) units. Both of these represent "total sum scaling (TSS)"-style normalization: in the former case, each sample is constrained to sum to 1, whereas in the latter case (CPMs) samples are constrained to sum to 1 million. Units out of 1 million are often more convenient for tables with many, many features (such as genefamilies.tsv
tables).
Note: CPM as used here does not refer to unnormalized COUNTS per million, but rather copies per million. CPMs as used here are a generic analog of the TPM (transcript per million) unit in RNA-seq. You may wish to use the abbreviation CoPM for added clarity.
By default, this script normalizes all stratification levels to the sum of all community feature totals, but other options (such as level-wise normalization) are supported. "Special" features (such as UNMAPPED) can be included or excluded in the normalization process.
$ humann_split_stratified_table --input $TABLE --output $OUTPUT_DIR
$TABLE
= gene/pathway table (tsv or biom format)$OUTPUT_DIR
= the directory to write two (one stratified and one unstratified) new gene/pathway tables (in biom format if input is biom format)-h
to see additional command line optionsThis utility will split a table into two files (one stratified and one unstratified).
$ humann_split_table --input $TABLE --output $OUTPUT_DIR
$TABLE
= gene/pathway table (tsv or biom format)$OUTPUT_DIR
= the directory to write new gene/pathway tables (one per sample, in biom format if input is biom format)-h
to see additional command line optionsThis utility will split a merged feature table (multiple samples) into one file per sample. Some analyses can only accept one sample at a time.
Utility for executing HUMAnN tests. See Test the install above for further details.
$ humann_unpack_pathways --input-genes $INPUT_GENES.tsv --input-pathways $INPUT_PATHWAYS.tsv --output $OUTPUT.tsv
$INPUT_GENES.tsv
= a file containing the gene families (or EC) abundance table (tsv format)$INPUT_PATHWAYS.tsv
= a file containing the pathways abundance table (tsv format)$OUTPUT.tsv
= the file to write the new abundance table (tsv format)--remove-taxonomy
remove the taxonomy from the output file-h
to see additional command line optionsThis utility will unpack the pathways to show the genes for each. It adds another level of stratification to the pathway abundance table by including the gene families (or EC) abundances.
HUMAnN uses UniRef protein clusters as a gene families system. UniRef clusters are constructed by clustering proteins from UniProt to remove redundancy (UniRef100), further clustering non-redundant proteins at 90% identity and selecting representative sequences (UniRef90), and further clustering UniRef90 representative sequences at 50% identity to produce broader clusters (UniRef50). The representative of a given UniRef cluster is generally the best-annotated member of the cluster (which may or may not be the true centroid of the cluster). Additional information about UniRef can be found at the UniRef website and in the original UniRef publication.
HUMAnN falls back to translated search for reads that failed to align to a known pangenome. The scope of this translated search can be tuned, resulting in a balance between the fraction of unclassified reads mapped and performance. There are three possible modes:
Bypass translated search is selected via the --bypass-translated-search
flag. In this mode, reads that failed to map to a pangenome are saved, but not otherwise searched at the protein level. There will be no unclassified
strata in your output.
Filtered translated search is selected by using a EC-filtered protein database. These databases (which come in UniRef90 and UniRef50 flavors) have been filtered to only include proteins associated with a level-4 enzyme commission (EC) category. These are the only proteins that would be able to contribute to downstream pathway reconstruction. Thus, filtered translated search will provide a comprehensive metabolic reconstruction for the unclassified
stratum, but the number of unclassified
protein families in the output will be limited to those with EC annotation.
Comprehensive translated search is selected by using a comprehensive protein database. These databases (which come in UniRef90 and UniRef50 flavors) have not been filtered, and are quite large (millions of sequences). Comprehensive search aims to explain as many sample reads as possible using reference-based approaches. Comprehensive search takes ~5x longer than filtered search [a difference of 5 versus 25 CPU-hours for a 10M read sample (using DIAMOND and UniRef50)]. Memory requirements are also larger for the larger database, though the increase is sublinear.
See above for instructions on how to acquire a translated search database.
For many users, filtered translated search will serve as a good default option. This will still provide 1,000s of gene-level features and an uncompromised metabolic network for the unclassified
stratum. For extremely well-characterized samples, bypassing translated search is a reasonable option. For example, in the healthy human gut, the majority of sample reads (~80%) that can be mapped to a reference are mapped prior to translated search.
End-pairing relationships are currently not taken into account during HUMAnN's alignment steps. This is due to the fact that HUMAnN strictly aligns reads to isolated coding sequences: either at the nucleotide level or through translated search. As such, it will frequently be the case that one read (READ1
) will map inside a given coding sequence while its mate-pair (READ2
) will not.
Example:
GENEGENEGENE
READ1-------READ2
Penalizing such cases would be overly strict: in the absence of a the gene's genomic context, this looks like a perfectly reasonable alignment (READ2
may fall in a non-coding region and not align, or it may align to another [isolated] coding sequence). As a result, the best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file.
If you have paired-end Illumina sequencing data, processed by CASAVA v1.8+ the sequence identifiers for the pairs have the same id after truncating by space. To keep track of the individual reads throughout the HUMAnN workflow, the software will initially remove the spaces from the identifiers. This prevents read pairs from having the same id after being processed by bowtie2 and diamond. If this is the case for your read set, you will see an informative message printed to the screen and the log file indicating the removal of spaces from the sequence identifiers.
You can run HUMAnN with PICRUSt output from predict_metagenomes.py or metagenome_contributions.py as input. Output from metagenome_contributions.py can include taxonomy information which will be used by HUMAnN. The steps that follow are the same for output files from either PICRUSt script.
For information on the PICRUSt software, please see the project website. For questions about PICRUSt, please contact the PICRUSt user group. The instructions that follow are for PICRUSt version 1.0.0-dev.
If you are running HUMAnN with PICRUSt output as input, please follow these steps:
Download the legacy kegg databases included in HUMAnN 1.0
Split the picrust output file (picrust.biom or picrust.tsv) into a single file per sample (written to $OUTPUT_DIR)
$ humann_split_table --input picrust.biom --output $OUTPUT_DIR
--taxonomy_index -1
can be added if taxonomy information is included in the biom input file with column -1 associated with K0s.Run HUMAnN on each of the new files in $OUTPUT_DIR placing the results in $OUTPUT_DIR2
$ humann --input $SAMPLE.biom --output $OUTPUT_DIR2 --pathways-database humann1/data/keggc
--remove-stratified-output
can be added if you do not want the data stratified by bug.--output-format biom
can be added if you want the output to be in biom format.--pathways-database humann1/data/modulec
.Join the pathways data files from the HUMAnN runs from all samples into two files
$ humann_join_tables --input $OUTPUT_DIR2 --output humann_4_pathabundance.tsv --file_name pathabundance
Please note the flag --verbose
can be added to all commands.
A joint taxonomic profile can be created from all of the samples in your set. To create this file and use it for your HUMAnN runs, please use the steps that follow.
Create taxonomic profiles for each of the samples in your set with MetaPhlAn
Join all of the taxonomic profiles, located in directory $DIR, into a table of taxonomic profiles for all samples (joined_taxonomic_profile.tsv)
$ humann_join_tables --input $DIR --output joined_taxonomic_profile.tsv
Reduce this file into a taxonomic profile that represents the maximum abundances from all of the samples in your set
$ humann_reduce_table --input joined_taxonomic_profile.tsv --output max_taxonomic_profile.tsv --function max --sort-by level
Run HUMAnN on all of the samples in your set, providing the max taxonomic profile
$ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --taxonomic-profile max_taxonomic_profile.tsv
An alterative to step 4, which will save computing time, is to first run a single sample with the taxonomic profile. The HUMAnN temp output folder for this sample will contain the bowtie2 indexed custom ChocoPhlAn database that can be provided when running your remaining samples. This will save compute time as this database will only be created once. Please see the steps below for the alternative to step 4.
Run HUMAnN on one of your samples ($SAMPLE_1.fastq) providing the max taxonomic profile to create the custom indexed ChocoPhlAn database
$ humann --input $SAMPLE_1.fastq --output $OUTPUT_DIR --taxonomic-profile max_taxonomic_profile.tsv
Run HUMAnN on the rest of your samples providing the custom indexed ChocoPhlAn database ($OUTPUT_DIR/$SAMPLE_1_humann_temp/)
$ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --nucleotide-database $OUTPUT_DIR/$SAMPLE_1_humann_temp/ --bypass-nucleotide-index
A custom taxonomic profile can be created to specify the taxa included in your samples. This file is used by HUMAnN to create the custom ChocoPhlAn database for your samples.
The custom taxonomic profile must be in a tab-demilited format and contain two columns (taxon, with both genus and species, and percent abundance). An example follows:
g__Bacteroides|s__Bacteroides_thetaiotaomicron 12.16326
g__Bacteroides|s__Bacteroides_cellulosilyticus 12.02768
g__Bacteroides|s__Bacteroides_caccae 11.43394
g__Dialister|s__Dialister_invisus 10.52286
g__Bacteroides|s__Bacteroides_stercoris 10.42227
HUMAnN uses the taxonomic profile to select pangenomes for the custom ChocoPhlAn database from the full ChocoPhlAn database. For example, the first line in the example above will add the centriods from Bacteroides thetaiotaomicron to the custom ChocoPhlAn database. Please note the taxa in the custom taxonomic profile must match the naming convention used by the full ChocoPhlAn database (ignoring case).
To run HUMAnN with the custom taxonomic profile ($FILE), use the option "--taxonomic-profile $FILE". This will bypass running MetaPhlAn, which creates a taxonomic profile, and instead will use the custom taxonomic profile provided. From the custom taxonomic profile, only those taxa that have a percent abundance greater than the default prescreen threshold will be considered. To change the default setting for the prescreen threshold to $THRESHOLD, use the option "--prescreen-threshold $THRESHOLD".
A custom nucleotide reference database can be provided to HUMAnN
This custom database must be formatted as a bowtie2 index.
Please see the Custom reference database annotations section for information on database annotations. Also please note, only alignments to genes included in the pathways databases will be considered in the pathways computations. The pathways databases included with HUMAnN are for alignments to UniRef gene families. If you would like to create custom pathways databases for a different set of gene families, please see the Custom pathways database section for more information.
To run HUMAnN with your custom nucleotide reference database (located in $DIR), use the option "--bypass-nucleotide-index" and provide the custom database as the ChocoPhlAn option with "--nucleotide-database $DIR". If you would like to bypass the translated alignment portion of HUMAnN, add the option "--bypass-translated-search".
A custom protein reference database can be provided to HUMAnN
This custom database must be formatted to be used by the translated alignment software selected (the default is Diamond).
Please see the Custom reference database annotations section for information on database annotations. Also please note, only alignments to genes included in the pathways databases will be considered in the pathways computations. The pathways databases included with HUMAnN are for alignments to UniRef gene families. If you would like to create custom pathways databases for a different set of gene families, please see the Custom pathways database section for more information.
To run HUMAnN with your custom protein reference database (located in $DIR), provide the custom database as the UniRef option with "--protein-database $DIR". Please note, HUMAnN will run on all of the databases in this folder ($DIR) which have been formatted to be used by the translated alignment software selected. Also if you would like to bypass the nucleotide alignment portion of HUMAnN, add the option "--bypass-nucleotide-search".
Note: HUMAnN can work with a protein database that's been split into chunks for computational efficiency (mapping the reads to them serially). Therefore HUMAnN models a protein database as a folder with one or more database chunks in it (derived from the same input set of protein sequences). If you mix unrelated chunks in the same folder HUMAnN will (correctly) complain at you. Hence if you want to have two separate DBs, even if each is only represented by a single chunk, they should be in separate folders.
The annotations for sequences in a custom (nucleotide or protein) reference database can be as follows (delimited by "|"):
For options 1 and 2, HUMAnN defaults will be used. The default gene length is 1,000 bases and the default taxonomy is "unclassified".
Option 4 should be used along with a custom reference database annotation file. The custom reference database annotation file maps the identifiers to annotations. This file must be in a tab-delimited format and contain at least two columns (identifier and gene family). At most four columns of information can be included to describe each reference sequence. These columns should be organized as identifier, gene family, gene length, and taxonomy. An example follows:
256402719 UniRef50_C9LQU5 147 g__Dialister.s__Dialister_invisus
479150083 UniRef50_R6U703 540 g__Ruminococcus.s__Ruminococcus_bromii
423220654 UniRef50_I8UUJ6 1218 g__Bacteroides.s__Bacteroides_caccae
The first line of the example will use the gene family UniRef50_C9LQU5, gene length 147, and taxon g__Dialister.s__Dialister_invisus
for any sequences in your reference databases with the identifier 256402719.
To run HUMAnN with the custom reference database annotations ($FILE), use the option "--id-mapping $FILE".
The pathways databases included with HUMAnN (from MetaCyc and UniProt) have been created to be used with alignments to UniRef gene families. A custom pathways database can be provided to HUMAnN, specifically made to work with your custom reference database(s).
One or two pathways database files (in a comma-delimited list) can be provided to HUMAnN with the option "--pathways-database $FILE". If two files are provided, the first file provides a tab-delimited mapping while the second file provides the pathways mapping. For example, the first file could provide a mapping of gene families to reactions while the second file maps reactions to pathways.
The first file, which is optional, should be organized to include at least two columns per line. The first column is the item to be mapped to (ie reactions from the example) while the remaining columns in the row are the gene families which can be mapped to the item in the first column. An example follows:
RXN-123 UniRef50_A0B6Z6 UniRef50_A3CRP6
RXN-456 UniRef50_A2RVM0 UniRef50_A4IGM4 UniRef50_A6NKP2 UniRef50_B8H806
The pathways file is a tab-delimited file with the first column the name of the pathway. It can be in a structured or unstructured format. In a structured format, the second column includes the items (space-delimited) contained in the pathway. The structure follows the same definition as that for Kegg modules. Each structure is a list (space-delimited) of items with a comma to indicate alternatives, a plus to indicate a complex, and a minus sign at the beginning of an item to indicate this is not a key item in the pathway. These items are gene families if only one pathways file is provided. If two files are provided, as in the example, these items would be reactions. In an unstructured format, the second column and any remaining columns in the row are the items that map to the pathway. For example, if a pathway contained four genes the line for this pathway in an unstructured format file would contain five columns. The first column is the pathway name and the remaining columns would each contain a single gene.
An example of a structured pathways file follows:
PWY-1 A B ( C , D )
PWY-2 ( ( A + B ) , ( C + D ) ) E
An example of an unstructured pathways file follows:
PWY-3 A B C D
PWY-4 A B C D E
HUMAnN output files can be provided to QIIME as input to run core diversity analysis. For information on the QIIME software, please see the project website. For questions about QIIME, please contact the QIIME user group. The instructions that follow are for QIIME version 1.9.1.
To run this analysis, run the following steps:
Run HUMAnN on each of the samples (replacing $OUTPUT_DIR with the full path to the folder to write the output)
$ humann --input $SAMPLE.fastq --output $OUTPUT_DIR --output-format biom --remove-stratified-output --output-max-decimals 0
Merge the three output files for each sample into three output files for all samples using QIIME's merge_otu_tables.py
$ merge_otu_tables.py -i $SAMPLE1_2_genefamilies.biom,$SAMPLE2_2_genefamilies.biom,$SAMPLE3_2_genefamilies.biom -o genefamilies_all.biom
$ merge_otu_tables.py -i $SAMPLE1_4_pathabundance.biom,$SAMPLE2_4_pathabundance.biom,$SAMPLE3_4_pathabundance.biom -o pathabundance_all.biom
For each of the merged biom files, run QIIME's biom summarize-table to obtain the sampling depth (e-value) required as input for the next step.
$ biom summarize-table -i genefamilies_all.biom -o genefamilies_summary.txt
$ biom summarize-table -i pathabundance_all.biom -o pathabundance_summary.txt
Next run each merged biom file through QIIME's core_diversity_analyses.py, providing the e-value from the prior step (replacing $EVALUE for each input file) and the QIIME mapping file you created (replacing $MAPPING_FILE with the full path to the mapping file).
$ core_diversity_analyses.py -i genefamilies_all.biom -o core_diversity_genefamilies -m $MAPPING_FILE --nonphylogenetic_diversity -e $EVALUE --suppress_taxa_summary
$ core_diversity_analyses.py -i pathabundance_all.biom -o core_diversity_pathabundance -m $MAPPING_FILE --nonphylogenetic_diversity -e $EVALUE --suppress_taxa_summary
The output of QIIME's core_diversity_analyses.py will include PCoA plots of beta diversity analyses which can be viewed in the beta diversity folder, in the bray_curtis_emperor_pcoa_plot folder. Click on index.html, which will open a web browser with the plot. Using the tools on the right side, it is possible to change the colors by which the samples are labeled.
Under the color tab, there are two variables that can be changed:
The more categories in your mapping file, the more options you have to see if your samples are separating based on that feature.
The recommended HUMAnN metatranscriptome (RNA) analysis protocol differs depending on whether or not you have metagenomic (DNA) reads available from the same biological sample.
Analyzing a metatranscriptome with a paired metagenome. In this case, we recommend analyzing the metagenome first. Then, rather than constructing a taxonomic profile directly from the metatranscriptome, use the taxonomic profile of the corresponding metagenome as an additional input to HUMAnN via the --taxonomic-profile
flag. This will guarantee that RNA reads are mapped to any species' pangenomes detected in the metagenome. RNA reads are otherwise provided as input to HUMAnN just as DNA reads are. HUMAnN RNA-level outputs (e.g. transcript family abundance) can then be normalized by corresponding DNA-level outputs to quantify microbial expression independent of gene copy number. CAVEAT: For low-abundance species, random sampling may lead to detection of transcripts for undetected genes. In these cases, we recommend smoothing DNA-level features to avoid divide-by-zero errors during normalization.
Analyzing an isolated metatranscriptome (without a paired metagenome). In this case, analyze RNA read data just as you would DNA data (provided as a fasta/fastq file). CAVEAT 1: Note that species are quantified in HUMAnN based on recruitment of reads to species-specific marker genes. While each genome copy is assumed to donate ~1 copy of each marker to metagenome (DNA) data, the same assumption cannot be made for RNA data (markers may be more or less transcribed within a species compared to the species average). As long as a non-trivial fraction of a species' markers are expressed, HUMAnN will still detect that species in the transcript pool. However, species relative abundance estimates from the taxonomic profile must be interpretted carefully: these values reflect species' relative contributions to the pool of species-specific transcripts, and not the overall transcript pool. CAVEAT 2: Transcript abundance inferred from a lone metatranscriptome is confounded with underlying gene copy number. For example, transcript X may be more abundant in sample A relative to sample B because (i) the same number of underlying X genes are more highly expressed in sample A relative to sample B or (ii) there are more copies of gene X in sample A relative to sample B (all of which are equally expressed). This is a general challenge in analyzing isolated metatranscriptomes (not specific to HUMAnN).
By default, the gene families and pathways output files from HUMAnN are species level. To obtain genus level gene families and pathways, follow these steps.
Create a genus level gene families file
$ humann_gene_families_genus_level --input $SAMPLE_2_genefamilies.tsv --output $SAMPLE_genefamilies_genus_level.tsv
$SAMPLE_2_genefamilies.tsv
with the species level gene families file created by default by HUMAnN and $SAMPLE_genefamilies_genus_level.tsv
with the name of the gene families genus level file that will be created.Run HUMAnN, with the genus level gene families file as input, to get genus level pathways output files
$ humann --input $SAMPLE_genefamilies_genus_level.tsv --output humann_genus_level_output
HUMAnN frequently asked questions:
--verbose
flag--threads $CORES
option--remove-temp-output
flag--nucleotide-database $DIR
option--protein-database $DIR
option--taxonomic-profile file.tsv
option--output-basename $NAME
option--remove-stratified-output
flag--pathways unipathway
option--output-format biom
option--identity-threshold <50.0>
option--metaphlan-options="-t rel_ab"
optionusage: humann [-h] -i <input.fastq> -o <output> [--threads <1>] [--version]
[-r] [--bypass-nucleotide-index] [--bypass-nucleotide-search]
[--bypass-prescreen] [--bypass-translated-search]
[--taxonomic-profile <taxonomic_profile.tsv>]
[--memory-use {minimum,maximum}]
[--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}]
[--metaphlan <metaphlan>]
[--metaphlan-options <metaphlan_options>]
[--prescreen-threshold <0.5>] [--bowtie2 <bowtie2>]
[--bowtie-options <bowtie_options>]
[--nucleotide-database <nucleotide_database>]
[--nucleotide-identity-threshold <0.0>]
[--nucleotide-query-coverage-threshold <90.0>]
[--nucleotide-subject-coverage-threshold <50.0>]
[--diamond <diamond>] [--diamond-options <diamond_options>]
[--evalue <1.0>] [--pick-frames {on,off}]
[--protein-database <protein_database>]
[--translated-alignment {diamond}]
[--translated-identity-threshold <50.0>]
[--translated-query-coverage-threshold <90.0>]
[--translated-subject-coverage-threshold <50.0>]
[--gap-fill {on,off}] [--minpath {on,off}]
[--pathways {metacyc,unipathway}]
[--pathways-database <pathways_database.tsv>] [--xipe {on,off}]
[--annotation-gene-index <3>] [--id-mapping <id_mapping.tsv>]
[--remove-temp-output]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--o-log <sample_0.log>] [--output-basename <sample_name>]
[--output-format {tsv,biom}] [--output-max-decimals <10>]
[--remove-column-description-output]
[--remove-stratified-output]
HUMAnN : HMP Unified Metabolic Analysis Network
optional arguments:
-h, --help show this help message and exit
[0] Common settings:
-i <input.fastq>, --input <input.fastq>
input file of type {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
[REQUIRED]
-o <output>, --output <output>
directory to write output files
[REQUIRED]
--threads <1> number of threads/processes
[DEFAULT: 1]
--version show program's version number and exit
[1] Workflow refinement:
-r, --resume bypass commands if the output files exist
--bypass-nucleotide-index
bypass the nucleotide index step and run on the indexed ChocoPhlAn database
--bypass-nucleotide-search
bypass the nucleotide search steps
--bypass-prescreen bypass the prescreen step and run on the full ChocoPhlAn database
--bypass-translated-search
bypass the translated search step
--taxonomic-profile <taxonomic_profile.tsv>
a taxonomic profile (the output file created by metaphlan)
[DEFAULT: file will be created]
--memory-use {minimum,maximum}
the amount of memory to use
[DEFAULT: minimum]
--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
the format of the input file
[DEFAULT: format identified by software]
-v, --verbose additional output is printed
[2] Configure tier 1: prescreen:
--metaphlan <metaphlan>
directory containing the MetaPhlAn software
[DEFAULT: $PATH]
--metaphlan-options <metaphlan_options>
options to be provided to the MetaPhlAn software
[DEFAULT: "-t rel_ab"]
--prescreen-threshold <0.5>
minimum estimated genome coverage for inclusion in pangenome search
[DEFAULT: 0.5]
[3] Configure tier 2: nucleotide search:
--bowtie2 <bowtie2> directory containing the bowtie2 executable
[DEFAULT: $PATH]
--bowtie-options <bowtie_options>
options to be provided to the bowtie software
[DEFAULT: "--very-sensitive"]
--nucleotide-database <nucleotide_database>
directory containing the nucleotide database
[DEFAULT: /home/ljmciver/miniconda3/envs/humann3/lib/python3.6/site-packages/humann/data/chocophlan_DEMO]
--nucleotide-identity-threshold <0.0>
identity threshold for nuclotide alignments
[DEFAULT: 0.0]
--nucleotide-query-coverage-threshold <90.0>
query coverage threshold for nucleotide alignments
[DEFAULT: 90.0]
--nucleotide-subject-coverage-threshold <50.0>
subject coverage threshold for nucleotide alignments
[DEFAULT: 50.0]
[3] Configure tier 2: translated search:
--diamond <diamond> directory containing the diamond executable
[DEFAULT: $PATH]
--diamond-options <diamond_options>
options to be provided to the diamond software
[DEFAULT: "--top 1 --outfmt 6"]
--evalue <1.0> the evalue threshold to use with the translated search
[DEFAULT: 1.0]
--protein-database <protein_database>
directory containing the protein database
[DEFAULT: /home/ljmciver/miniconda3/envs/humann3/lib/python3.6/site-packages/humann/data/uniref_DEMO]
--translated-alignment {diamond}
software to use for translated alignment
[DEFAULT: diamond]
--translated-identity-threshold <50.0>
identity threshold for translated alignments
[DEFAULT: 50.0]
--translated-query-coverage-threshold <90.0>
query coverage threshold for translated alignments
[DEFAULT: 90.0]
--translated-subject-coverage-threshold <50.0>
subject coverage threshold for translated alignments
[DEFAULT: 50.0]
[5] Gene and pathway quantification:
--gap-fill {on,off} turn on/off the gap fill computation
[DEFAULT: on]
--minpath {on,off} turn on/off the minpath computation
[DEFAULT: on]
--pathways {metacyc,unipathway}
the database to use for pathway computations
[DEFAULT: metacyc]
--pathways-database <pathways_database.tsv>
mapping file (or files, at most two in a comma-delimited list) to use for pathway computations
[DEFAULT: metacyc database ]
--xipe {on,off} turn on/off the xipe computation
[DEFAULT: off]
--annotation-gene-index <3>
the index of the gene in the sequence annotation
[DEFAULT: 3]
--id-mapping <id_mapping.tsv>
id mapping file for alignments
[DEFAULT: alignment reference used]
[6] More output configuration:
--remove-temp-output remove temp output files
[DEFAULT: temp files are not removed]
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
level of messages to display in log
[DEFAULT: DEBUG]
--o-log <sample_0.log> log file
[DEFAULT: sample_0.log]
--output-basename <sample_name>
the basename for the output files
[DEFAULT: input file basename]
--output-format {tsv,biom}
the format of the output files
[DEFAULT: tsv]
--output-max-decimals <10>
the number of decimals to output
[DEFAULT: 10]
--remove-column-description-output
remove the description in the output column
[DEFAULT: output column includes description]
--remove-stratified-output
remove stratification from output
[DEFAULT: output is stratified]
Thanks go to these wonderful people:
Please sign up for the HUMAnN category in bioBakery Forum if any questions or concerns.
Additionally, Google user group: humann-users@googlegroups.com (Read only) is available.