Dennis-xyHuang / PhyloPlus

MIT License
2 stars 0 forks source link

PhyloPlus

Introduction

This program will create a phylogeny where user specified taxa will be added to a publicly available, peer reviewed molecular bacterial or archaeal phylogeny. This approach takes advantage of the comprehensive bacterial/archaeal molecular phylogenetic trees generated by Genome Taxonomy Database (GTDB), and taxonomic information curated by NCBI. This tool allows users to locate/insert their own customized set of bacterial or archaeal taxa within either of these reference phylogenies and output such phylogenetic tree files. The output phylogeny can then be applied to downstream microbial community analysis, for example, metagenomic taxonomic diversity analysis.

The manuscript is now published in mBio (open-access). This application can be run locally (described below) or run on our web server at https://phylo.jifsan.org.

If you find this application useful, please kindly cite:
Huang, X., Erickson, D. L., & Meng, J. (2023). PhyloPlus: a Universal Tool for Phylogenetic Interrogation of Metagenomic Communities. mBio, e0345522. Advance online publication. https://doi.org/10.1128/mbio.03455-22 .

(back to top)

Dependencies

Python 3.9.12

R 4.2.2

(back to top)

Usage

Download and Summarize NCBI Dump Files

This step will download NCBI dump files containing information needed for this application into the ./NCBI_dmp_files directory and process them into corresponding summary files. To perform this step, please run:

./phyloplus.sh -m download

The download and summarization of NCBI dump files MUST be done prior to any other processes.

Test Run

A test run is available after downloading and summarizing NCBI dump files by running:

./phyloplus.sh -m test -e <your@email.address>

This step runs all necessary steps to generate a customized phylogeny using a sample bacterial input text file which contains 100 bacterial NCBI taxonomy IDs (./sample_data/sample_taxIDs.txt), and the reference directory for the test run is ./reference/bacterial/207. All output files will be written to ./sample_data/sample_output.

Generate Phylogeny

To generate phylogeny using a user-provided list of taxonomy IDs, the user need first create an input text file containing NCBI taxonomy IDs of interest. The input should be a one-column plain text file with no headers, check the sample input file ./sample_data/sample_taxIDs.txt for the input format if needed.

With the input text file in place, to generate the phylogeny, plese run:

./phyloplus.sh -m build -r <reference directory> -i <input file> -o <output directory> -e <email address> 
[-t <taxonomic rank> -t1 <threshold 1> -t2 <threshold 2> -t3 <threshold 3> -t4 <threshold 4>]

    <reference directory> =    Location of the reference directory. Choose one of the child directories in
                               ./reference.
    <input file>          =    Location of input taxonomy ID text file.
    <output directory>    =    Directory to write all generated outputs. Will create this directory if it
                               does not exist.
    <email address>       =    Set the email address per NCBI Entrez requirements.
    <taxonomic rank>      =    Taxonomic rank to display query taxa in the final output. Choose from
                               "species", "genus" or "family" (default: species).
    <threshold 1>         =    Threshold used to determine outlier tips if a query taxon is mapped at the
                               species level (t1 ≥ 0, default: 1).
    <threshold 2>         =    Threshold used to determine outlier tips if a query taxon is mapped at the
                               species group level (t2 ≥ 0, default: 2).
    <threshold 3>         =    Threshold used to determine outlier tips if a query taxon is mapped at the
                               genus level or above (t3 ≥ 0, default: 2).
    <threshold 4>         =    Threshold used to determine the outlier tip if only two reference tips
                               exist to locate a query taxon (0 < t4 < 1, default: 0.75).

The user-provided file can have taxonomy IDs with mixed taxonomic ranks (i.e., some taxonomy IDs represent species, some represent genus, etc.). Upon specifying the taxonomic rank with the -t flag, lower-level taxonomy IDs in the input file will be automatically converted and processed. For example, including taxonomy ID 562 (Escherichia coli) in the input and selecting genus as the taxonomic rank will automatically convert the species-level ID into its genus-level ID (561), but not vice versa.

The email address is needed per NCBI Entrez requirements, in case some records cannot be found in the dump summary files and an Entrez search is needed to retrieve such information. Click here for more details regarding Entrez guidelines and requirements.

Output

All output files will be written to the user-specified output directory, including:

(back to top)

Supplementary Notes

Reference Phylogenies

The bacterial and archaeal reference file were downloaded from GTDB and underwent slight modifications: removal of node labels in the tree files to enable proper handling by some R packages. These reference files are placed in corresponding child directories in the ./reference directory.

Thresholds

Inference of insertion node of a query taxon is based on a group of tips in the reference phylogeny that share the same taxonomy ID as the query species at a predetermined taxonomic rank. Thresholds are set to detect and remove potential outlier tips to avoid taxonomic misclassification and to better infer the location of the query taxon.

For groups containing more than two reference tips, outliers are defined as the tips whose average distance to other group members exceeds group mean plus N times standard deviation. Different thresholds are applied to query taxa that are mapped at different taxonomic levels (species, species group, and genus or above) and are specified by flags -t1, -t2, and -t3, respectively.

For groups containing only two reference tips where the application of standard deviation to detect potential outliers is impractical, the fraction (distance to the MRCA node) / (distance to the base root) for the more distant tip is used to indicate if this two-member group contains an outlier. The threshold for such fraction is specified by the flag -t4. The outlier is then determined by comparing the taxonomic lineages and distances of these two reference tips with their corresponding close neighboring tips.

(back to top)