Arcadia-Science / prehgt

A pipeline for lightweight screening of Eukaryotic genomes and transcriptomes for recent HGT
MIT License
12 stars 6 forks source link

preHGT: generating preliminary horizontal gene transfer (HGT) candidates using compositional and phylogenetic implicit approaches

Snakemake Nextflow run with conda run with docker

preHGT is a pipeline that quickly pre-screens genomes across the tree of life for horizontal gene transfer (HGT).

The preHGT pipeline wraps existing parametric and implicit phylogenetic methods for HGT detection and reports multiple metrics about input genome contamination. It quickly produces a "good-enough" candidate list of genes that researchers can further investigate with more stringent HGT detection methods, different data modalities, or wet lab experimentation.

For more scientific details about the preHGT, see this pub.

We're not picky about capitalization; choose whatever feels right to you (prehgt, preHGT) and feel free to capitalizate at the beginning of a sentence if it brings you joy (PreHGT).

Quick Start

The pipeline is implemented as either a Snakemake or a Nextflow workflow, with environment management with conda.

Installing conda

Software and environment management is performed with conda, regardless of if you run the pipeline with snakemake or nextflow. You can find operating system-specific instructions for installing miniconda here and for installing mamba here.

Obtaining databases

If you choose to run the workflow with Nextflow, you will provide the databases as command line parameters at run time. If you choose to run the workflow with Snakemake, you will need the databases to be located in specific folders. The download code below reflects the relative paths of the databases where the snakemake workflow will look for them in the prehgt workflow directory (see Running the pipeline with Snakemake below for instructions on how to obtain the directory locally).

mkdir -p inputs/
curl -JLo inputs/nr_rep_seq.fasta.gz https://osf.io/gqxva/download # 59Gb BLASTp database
curl -JLo inputs/nr_cluster_taxid_formatted_final.sqlite https://osf.io/5qj7e/download # 66Gb BLASTp taxonomy database
curl -JLo inputs/hmms/all_hmms.hmm https://osf.io/f92qd/download # 3.2Gb HMM database

If you're using Nextflow, you will also need the KofamScan databases (these are automatically downloaded by the Snakemake pipeline)

mkdir -p inputs/kofamscandb
curl -JLo inputs/kofamscandb/ko_list.gz ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz && gunzip inputs/kofamscandb/ko_list.gz -C inputs/kofamscandb/
curl -JLo inputs/kofamscandb/profiles.tar.gz ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz && tar xf inputs/kofamscandb/profiles.tar.gz -C inputs/kofamscandb/

Input sample sheet

The input sample sheet is the same for both workflows. It's a TSV file with the genus or genera you would like to run through the preHGT pipeline. An example is included here and reproduced below:

genus
Bigelowiella
Schizosaccharomyces

Running the pipeline with Nextflow

If you would like to run the pipeline with Nextflow, begin by installing Nextflow.

mamba env create -n prehgtnf nextflow=22.10.6
conda activate prehgtnf

Then, you can test the workflow with a small data set using:

nextflow run Arcadia-Science/prehgt -profile test,conda --outdir <OUTDIR>

To run the full pipeline, including supplying paths to databases, run:

nextflow run Arcadia-Science/prehgt -profile conda --outdir <OUTDIR> --input input.tsv --blast_db path_to_blast_db --blast_db_tax path_to_blast_db_taxonomy_file --ko_list path_to_ko_list_file --ko_profiles path_to_ko_profiles_folder --hmm_db path_to_hmm_db

Running the pipeline with Snakemake

If you would like to run the pipeline with Snakemake, begin by installing Snakemake and other runtime dependencies.

mamba env create -n prehgt --file environment.yml
conda activate prehgt

Next, clone this repository and cd into it:

git clone https://github.com/Arcadia-Science/prehgt.git
cd prehgt

The input files for the snakemake workflow are not parameterized, so you need to make sure the input sample sheet and databases are in the correct location with the correct file names. See the database download instructions above to make sure you placed your databases in the correct locations. Note, the KofamScan databases are downloaded by the pipeline itself, so no need to worry about those.

To start the pipeline, run:

snakemake --use-conda -j 8 --rerun-incomplete

where:

Additional documentation

Overview of the pipeline

The preHGT pipeline combines compositional scans, pangenome inference, and BLAST-based searches. For the interpretation of the BLAST results, we took advantage of a rich literature of BLAST-based HGT predictor indices and re-implemented many with the goal of providing the most information possible from one BLAST run with a single tool. To reduce overall run times, the pipeline employs clustering heuristics, including construction of a genus-level pangenome to reduce query size and searches against a clustered database to reduce database size. To reduce false positives, we included multiple screens for contamination based on similarity to database matches, position of gene in the contiguous sequence, and homolog presence in closely related genomes.

Below, we provide an overview of what each step of the pipeline does.

  1. Retrieving gene sequences and annotation files. The pipeline begins with the user providing a genus or genera of interest in a TSV file. The pipeline then scans GenBank and RefSeq for matching genomes and downloads relevant files. When a genome is available in both GenBank and RefSeq, only the RefSeq version is retained. This step also parses the input files in preparation for future steps.
    • ncbi-genome-download: Loops through GenBank and RefSeq to find and download all genomes with gene models (*_cds_from_genomic.fna.gz) and genome annotation files (*_genomic.gff.gz).
    • delete_gca_files.sh: If a genome is in both GenBank and RefSeq, this script deletes the GenBank version and only keeps the RefSeq version.
    • file parsing: All *_cds_from_genomic.fna.gz files are combined into a single file, and a CSV file that records the total number of downloaded genomes per genus is generated.
  2. Building a pangenome. For each genus, the pipeline then combines genes into a pseudopangenome, which reduces the number of genes that are investigated and provides metadata about the gene.
    • mmseqs easy-cluster: Nucleotide sequences are clustered at 90% length and identity.
    • EMBOSS transeq: Clustered nucleotide sequences are translated into amino acid sequences.
  3. Detecting HGT candidates. Using the genes in the pangenome, the pipeline uses two categories of approaches to identify HGT candidates.
    • Compositional scan. The first approach uses relative amino acid usage to detect proteins with outlying composition.
      • EMBOSS pepstats: Measures relative amino acid usage (RAAU) for each gene.
      • compositional_scans_to_hgt_candidates.R: Produces a list of genes that have outlying RAAU. Starts by parsing the pepstats results with the function parse_pepstats_to_amino_acid_frequencies(). Then, produces a distance matrix with the base R function dist() and hierarchically clusters the distance matrix with fastcluster’s hclust(). It detects outliers by cutting the resultant tree with height/1.5 and retaining any cluster that contains fewer than 0.1% of the pangenome size.
    • BLASTp scan. Uses BLASTp to identify proteins with distant homologs.
      • DIAMOND blastp: All genes in the pseudopangenome are BLASTed against a clustered version of NCBI’s nr database (90% length, 90% identity). The clustered database makes the BLASTp step faster and ensures results contain a variety of taxonomic lineages in cases where distant homology exists.
      • blastp_add_taxonomy_info.R: Adds lineage information to the BLASTp search using dplyr, dbplyr, and RSQLite.
      • blastp_to_hgt_candidates_kingdom.R:
      • Alien index: A score of HGT probability based on BLAST hit e-value for the top hit in the kingdom-level acceptor versus donor hits. Because it is based on e-value, it can be biased by BLAST database size.
      • HGT score: A score of HGT probability based on BLAST hit corrected bitscore for the top hit in the kingdom-level acceptor versus donor hits.
      • Donor distribution index: An index that calculates the specificity of an HGT candidate gene within the donor groups. It estimates how frequently does the HGT candidate occurs in all the donor groups.
      • (Normalized) entropy: A measure of the uncertainty or randomness of a set of probabilities. Since there are seven kingdoms investigated, the maximum entropy is log2(7). Entropy is normalized to 0-1 range by dividing by the maximum potential value.
      • Gini coefficient: A measure of inequality among values of a frequency distribution. A Gini coefficient close to 0 indicates that the qseqid is uniformly distributed across all kingdoms.
      • Aggregate hit score: A score of HGT liklihood calculated by subtracting the sum of normalized corrected bitscores in the donor group from the sum of normalized corrected bitscores in the acceptor group.
      • blastp_to_hgt_candidates_subkingdom.R:
      • Transfer index: A score for HGT probability calculated by comparing the corrected bitscores and taxonomic distance of all BLAST hits against the best BLAST hit to the database.
  4. Annotation. The pipeline then annotates the HGT candidates.
    • combine_and_parse_gff_per_genus.R: All downloaded *_genomic.gff.gz are combined and parsed to pull out annotation information for each coding domain sequence.
    • KofamScan: KofamScan uses hidden markov models (HMMs) to perform KEGG ortholog annotation.
    • hmmscan: We built a custom HMM database to scan for annotations of interest using HMMER3 hmmscan. The HMM database currently contains VOGs from VOGDB and biosynthetic genes, and can be extended in the future to meet user annotation interests.
  5. Reporting. The last step combines all information that the pipeline has produced and outputs the results in a TSV file using the script combine_results.R. The results include the GenBank protein identifier for the HGT candidate, BLAST and relative amino acid usage scores, pangenome information, gene and ortholog annotations, and contextualizing information about the gene such as position in the contiguous sequence. This script also reports whether an HGT candidate is actually contamination by integrating information about the number of times the gene is observed in the pangenome, the similarity of the gene to other hits in the database, and the length of the contiguous sequence that the gene is located on in the genome.

Below, we present a schematic overview of the pipeline.

Outputs

The main output is a TSV file summarizing the results of each part of the pipeline. Below we provide a description of each output column.

TSV summary file

Overview columns

Other outputs

The pipeline also produces a number of intermediate files, including the original *_cds_from_genomic.fna.gz files, full BLAST results, the HGT candidate gene sequences, and the pangenome gene cluster membership manifest. The exact path of these files will depend on whether the pipeline is executed in Snakemake or Nextflow.

Contributing

If you have suggestions or encounter any problems with prehgt, please file and issue. You are also welcome to submit a pull request if you feel like diving into the code. See this guide to see how we recognize feedback and contributions on our code.

Citations

If you use prehgt for your analysis, please cite it using the following DOI: 10.57844/arcadia-jfbp-7p11.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS file.