HiTaxon is an automated framework for creating custom short-read taxonomic classifiers for various environments. Given a list of genera, HiTaxon downloads and processes assemblies from RefSeq such that it creates a set of non-redundant sequences for species encompassed in the genera. HiTaxon then creates a custom database for a primary reference-dependent classifier and uses the classifier to generate taxonomic predictions on input FASTA files. The reference dependent genus-level outputs are used to determine which HiTaxon built specialized classifiers for specific genus to set of species pairs are used for species predictions.
Please refer to our paper for more information:
Verma, B. and Parkinson, J. HiTaxon: A hierarchical ensemble framework for taxonomic classification of short reads. Bioinformatics Advances. 2024. DOI: https://doi.org/10.1093/bioadv/vbae016
Data used to evaluate classifiers in our study can be accessed at:
https://doi.org/10.5281/zenodo.8335901
Clone the repository
git clone https://github.com/ParkinsonLab/HiTaxon
Navigate to the repository
cd HiTaxon
Create conda environment (Can take ~30 Minutes)
conda env create --name HiTaxon --file environment.yml
Activate conda environment
conda activate HiTaxon
Build HiTaxon
pip install .
Note: HiTaxon's database and classifier construction fuctionality was tested on SciNet's Niagara Linux systems. Evaluation of pre-made classifiers was tested on both the Niagara cluster alongside an intel-based MacBook Pro
In order to use HiTaxon, you must create a configuration file using the structure below and save it as config.file in the HiTaxon directory
ASSEMBLY_SUMMARY=default
GENUS_NAMES=/path/to/list_of_genera
KRAKEN_NAME=desired_database_name
KRAKEN_PATH=/path/to/store_database
MODEL_PATH=/path/to/store_ML_models
NUM_OF_THREADS=number_of_threads
OUTPUT_PATH=/path/to/download_and_process_RefSeq_data
REPORT_PATH=/path/to/store_taxonomic_predictions
BWA_PATH=/path/to/store_bwa_indices
Note 1: If ASSEMBLY_SUMMARY is left as default, HiTaxon will download and use the latest file from NCBI. If you want to use a specific version, provide the path to the specified assembly_summary text file
Note 2: For directory path declarations (i.e variables with PATH in name), if directory does not exist, HiTaxon will create the directory
The text file corresponding to GENUS_NAMES needs to be structured as below:
Bacillus
Enterococcus
Escherichia
Lactobacillus
Listeria
Staphylococcus
Salmonella
To create the best taxonomic classifier for a particular dataset when there are moderate time constraints, we recommend employing Kraken2-HiTaxon-Align, which is an hierarchical ensemble consisting of a Kraken2 classifier paired with a HiTaxon constructed database and specialized genus to set of species BWA indices. In our publication, we highlight that Kraken2-HiTaxon-Align is the best performing taxonomic classifier amongst all that were tested.
In order to build this ensemble, execute the following steps in order:
Create a directory to store sequence data from RefSeq.
Within this directory, create a file called taxon.txt
which lists all genera of interest, using the same format highlighted earlier in the documentation.
In the HiTaxon directory, create a config.file
which defines a set of important parameters for HiTaxon, using the same format highlighted earlier in the documentation.
Download sequences from RefSeq that pertain to the set of genera listed in taxon.txt
:
./HiTaxon.sh --collect
Cluster similiar assemblies and coding sequences:
./HiTaxon.sh --process
Create a custom database composed of non-redundant sequences aquired from Step 5 for Kraken2:
./HiTaxon.sh --build
Create specialized BWA indices for each genus to set of species pair using non-redundant sequences aquired from Step 5 for BWA:
./HiTaxon.sh --align
Note: If execution is terminated mid-construction of index for specific genus to set of species pair, simply delete the .ann file corresponding to it and rerun the command
Use the hierarchical ensemble to generate taxonomic predictions for short-reads in input.fasta
:
./HiTaxon.sh --evaluate -f path/to/input.fasta -o name_of_output_report -m Kraken2_BWA
Will generate an output of {name_of_output_report}_ensemble_bwa.csv
To create the best taxonomic classifier for a particular dataset when there are significant time constraints, we recommend employing Kraken2-HiTaxon-DB, which consists solely of Kraken2 with a HiTaxon curated database. This approach has a small reduction in MCC relative to Kraken2-HiTaxon-Align but requires much less time to generate predictions.
In order to build this classifier, execute the following steps in order:
Create a directory to store sequence data from RefSeq.
Within this directory, create a file called taxon.txt
which lists all genera of interest, using the same format highlighted earlier in the documentation.
In the HiTaxon directory, create a config.file
which defines a set of important parameters for HiTaxon, using the same format highlighted earlier in the documentation.
Download sequences from RefSeq pertaining to the set of genera listed in taxon.txt
:
./HiTaxon.sh --collect
Cluster similiar assemblies and coding sequences:
./HiTaxon.sh --process
Create a custom database composed of non-redundant sequences aquired from Step 5 for Kraken2:
./HiTaxon.sh --build
Use the hierarchical ensemble to generate taxonomic predictions for short-reads in input.fasta
:
./HiTaxon.sh --evaluate -f path/to/input.fasta -o name_of_output_report -m Kraken2
Will generate an output of {name_of_output_report}_lineage_kraken.csv
To create a hierarchical ensemble using an existing Kraken2 database with HiTaxon-curated specialized classifiers, execute the following steps in order:
Create a directory to store sequence data from RefSeq.
Within this directory, create a file called taxon.txt
which lists all genera of interest, using the same format highlighted earlier in the documentation.
In the HiTaxon directory, create a config.file
, using the same format highlighted earlier in the documentation. Make sure config.file references the path in which the pre-existing Kraken2 database is stored, alongside the name of the database
Download sequences from RefSeq pertaining to the set of genera listed in taxon.txt
:
./HiTaxon.sh --collect
Cluster similiar assemblies and coding sequences:
./HiTaxon.sh --process
Option A: Create specialized BWA indices for each genus to set of species pair using non-redundant sequences aquired from Step 5 for BWA:
./HiTaxon.sh --align
Option B. Train specialized ML for each genus to set of species pair using non-redundant sequences aquired from Step 5:
./HiTaxon.sh --train
Note: You will be prompted by the command line on as to whether the HiTaxon is in 1) data_creation or 2) model_creation mode, select option 2 .
Option A. Use the hierarchical ensemble of Kraken2 and BWA to generate taxonomic predictions for short-reads in input.fasta
:
./HiTaxon.sh --evaluate -f path/to/input.fasta -o name_of_output_report -m Kraken2_BWA
Option B. Use the hierarchical ensemble of Kraken2 and ML classifiers to generate taxonomic predictions for short-reads in input.fasta
:
./HiTaxon.sh --evaluate -f path/to/input.fasta -o name_of_output_report -m Kraken2_ML
To train ML classifiers for HiTaxon, we have built a robust framework for training both multi-class classifiers (i.e genus encompasses multiple species) and binary classifiers (i.e genus encompasses a single species). However, while we used FastText classifiers, we do acknowledge that researchers might prefer to test different algorithms for species classification. Consequently, we allow for HiTaxon to create and process training data without forcing the user to also train ML models. To use this feature, execute the following steps in order:
taxon.txt
which lists all genera of interest, using the same format highlighted earlier in the documentation.config.file
, using the same format highlighted earlier in the documentation.taxon.txt
:
./HiTaxon.sh --collect
./HiTaxon.sh --process
./HiTaxon.sh --train
You will be prompted whether HiTaxon is in 1) data_creation or 2) model_creation mode, select option 1. This will generate a .txt for all genus to set of species pairs, for both multi-class and binary problems, which can be easily reformatted by researchers for their ML algorithm of choice.
Note: If you want to change the K-mer value from the default of K = 13, a single parameter change in train.py, which can be found in scripts/train/, is all that is needed