AnantharamanLab / vRhyme

Binning Virus Genomes from Metagenomes
GNU General Public License v3.0
55 stars 9 forks source link

vRhyme

Binning Virus Genomes from Metagenomes

March 2022   
Kristopher Kieft  
kieft@wisc.edu  
Anantharaman Lab  
University of Wisconsin-Madison  

Current Version

vRhyme v1.1.0

Citation

If you find vRhyme useful please consider citing our manuscript on Nucleic Acids Research:
Kieft, K., Adams, A., Salamzade, R., Kalan, L., & Anantharaman, K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Research, 2022.


Table of Contents:

  1. Updates
    • v1.1.0
    • v1.0.0
  2. Program Description
  3. Installation
  4. Requirements
    • Program Dependencies
    • Python3 Dependencies
  5. Running vRhyme
    • Test examples
    • Quick run examples
  6. Output Explanations
    • Useful outputs
    • Other outputs
  7. Interpreting vRhyme bins/vMAGs
  8. vRhyme Files and Folders
  9. vRhyme Flag Descriptions
    • Flag compatibility
    • Commonly used flags
    • Other flags
  10. Contact

Updates for v1.1.0 (March 2022):

Updates for v1.0.0 (December 2021):


Program Description

vRhyme Description

vRhyme is a multi-functional tool for binning virus genomes from metagenomes. vRhyme functions by utilizing coverage variance comparisons and supervised machine learning classification of sequence features to construct viral metagenome-assembled genomes (vMAGs).

IMPORTANT NOTE: vRhyme is built to run on viral sequences/scaffolds. A typical workflow is to predict viruses from a metagenome (e.g., with VIBRANT or VirSorter) and then use those predictions as input to vRhyme. vRhyme can take an entire metagenome as input, but the performance for a whole metagenome has not been fully evaluated. vRhyme is not meant to bin microbes.

Why "vRhyme"? The similarity in sequence features between two scaffolds can be used to identify fragments of the same genome, such as tetranucleotide frequencies, codon usage or GC content. It's almost like rhyming sequences together to create pairs that sound similar, at least metaphorically at the nucleotide level. Coverage variance helps to separate scaffolds that sound the same but are actually different genomes.

vRhyme Features


Installation

GitHub, pip, and conda

  1. git clone https://github.com/AnantharamanLab/vRhyme
  2. cd vRhyme
  3. gunzip vRhyme/models/vRhyme_machine_model_ET.sav.gz ← NOTE: vRhyme is a subdirectory of the parent vRhyme
  4. optional create a conda environment (see examples below)
  5. optional activate conda environment if you made one
  6. pip install . ← NOTE: don't forget the dot (pip install [dot])

Installing with pip is optional but suggested. Using pip will collect Python dependencies and add vRhyme to your system PATH. Note that vRhyme.egg-info and build should be created after the pip install. Without pip, vRhyme can still be executed directly from the git clone, just ensure executable permissions (cd vRhyme/; chmod +x vRhyme scripts/*.py aux/*.py). The conda environment is also optional but can be useful for downloading and managing program dependencies.

Example Conda Environments

Test the Installation

Test and validate the installed dependencies for vRhyme. In these tests you're looking for Success statements. All Python dependencies must be Success. Please update them if prompted. Both machine learning models must be Success. For program dependencies, only Mmseqs is required to be Success, the others are optional depending on vRhyme usage. If you use any coverage input beside -c then Samtools is required to be Success. If you skip vRhyme -p or -g flags then Prodigal must be Success. If you plan to use vRhyme's dereplication function (see --method) then Mash and Nucmer must be Success. If you input reads (-r/-v/-u) then Bowtie2 and/or BWA (see --aligner) must be Success.

Print the main vRhyme help page:
vRhyme -h


Requirements

Program Dependencies

Please ensure the following programs are installed and in your machine's PATH. Note: most downloads will automatically place these programs in your PATH. See each program's source website for installation guides.

Required
  1. Python3 (version >= 3.6)
  2. Mmseqs2
  3. Samtools
    Optional (depends on usage)
  4. Prodigal
  5. Mash
  6. Nucmer
  7. Bowtie2
  8. BWA

Python3 Dependencies

There are several Python3 dependencies that must be installed as well. You may already have most of these installed. See each package's source website for installation guides. Versions are important.

  1. Pandas (version >= 1.0.0)
  2. Numpy (version >= 1.17.0)
  3. Scikit-learn (version >= 0.23.0)
  4. Numba (version >= 0.50.0)
  5. PySam (version >= 0.15.0)
  6. NetworkX (version >= 2.0)

Running vRhyme

Test run examples

Test out vRhyme on the provided example dataset. NOTE: If you choose to not install with pip, the vRhyme executable is within the vRhyme/ subdirectory.

cd examples/

minimal coverage table input example

vRhyme -i example_scaffolds.fasta -c example_coverage_values.tsv -t 1

full coverage table input example

vRhyme -i example_scaffolds.fasta -o vRhyme_example_results_coverage-table/ -c example_coverage_values.tsv -p example_scaffolds.prodigal.faa -g example_scaffolds.prodigal.ffn -t 2

Quick run examples

minimum input example with bam files

vRhyme -i fasta -b bam_folder/*.bam

minimum input example with a coverage file

vRhyme -i fasta -c coverage_file.tsv

full BAM input example

vRhyme -i fasta -g genes -p proteins -b bam_folder/*.bam -t threads -o output_folder/

reads input example with dereplication

vRhyme -i fasta -g genes -p proteins -r paired_reads_folder/*.fastq -t threads -o output_folder --method longest

only use dereplicate function

vRhyme -i input_fasta -t threads -o output_folder/ --derep_only --method longest


Output Explanations

Useful outputs

Other outputs

Hierarchy

> main output folder (-o)
    - log_vRhyme_(-i).log
    - log_vRhyme_paired_reads.log *
    - (-i).prodigal.faa *
    - (-i).prodigal.ffn *
    - (-i).circular.tsv
    - vRhyme_best_bins.#.membership.tsv
    - vRhyme_best_bins.#.summary.tsv
    - vRhyme_machine_distances.tsv
    > vRhyme_coverage_files *
        - (sample).coverage.tsv
        - vRhyme_coverage_values.tsv
        - vRhyme_names.txt
    > vRhyme_alternate_bins
        - #.membership.tsv
        - #.summary.tsv
        - vRhyme_bin_scoring.tsv
    > vRhyme_best_bins_fasta
        - vRhyme_bin_#.faa
        - vRhyme_bin_#.fasta
        - vRhyme_bin_#.ffn
    > vRhyme_bam_files *
        - (sample).sorted.bam
        - (sample).bam *
        - (sample).bam.bai *
    > vRhyme_sam_files *
        - (sample).sam

Interpreting vRhyme bins/vMAGs

The following bullet points are guidelines on interpreting binning results from vRhyme. Please note that this list is not exhaustive. For examples and data, please see analyses done in the vRhyme publication.


vRhyme files and folders


vRhyme Flags

Flag and input compatibility

Flag explanations

Commonly Used
(typical usage inputs and options)
Other Inputs
(mostly inputs besides -b)
Edit Outputs
(these typically do not need to be modified and do not effect binning results)
Read Alignment
(select read map software or modify alignment filtering)
Bin Filters
(these typically do not need to be modified)
Dereplication
(options to modify when using dereplication function)

Contact

Please contact Kristopher Kieft (kieft@wisc.edu or GitHub Issues) with any questions, concerns or comments.

Thank you for using vRhyme!

______________________________________________________________________

             ## ## ## ##                                              
             ##       ##  ##      ##     ##    ## ## ##     # ## ##   
##       ##  ##       ##  ##       ##    ##  ##   ##   ##  ##      #  
 ##     ##   ##     ##    ##         ## ##   ##   ##   ##  ## ## ##   
  ##   ##    ## ####      ## ## ##     ##    ##   ##   ##  ##         
   ## ##     ##   ##      ##    ##    ##     ##   ##   ##  ##         
    ###      ##     ##    ##    ##   ##      ##   ##   ##   ## ## ##  
______________________________________________________________________

Copyright

vRhyme Copyright (C) 2022 Kristopher Kieft

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.