March 2022
Kristopher Kieft
kieft@wisc.edu
Anantharaman Lab
University of Wisconsin-Madison
vRhyme v1.1.0
If you find vRhyme useful please consider citing our manuscript on Nucleic Acids Research:
Kieft, K., Adams, A., Salamzade, R., Kalan, L., & Anantharaman, K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Research, 2022.
vRhyme is a multi-functional tool for binning virus genomes from metagenomes. vRhyme functions by utilizing coverage variance comparisons and supervised machine learning classification of sequence features to construct viral metagenome-assembled genomes (vMAGs).
IMPORTANT NOTE: vRhyme is built to run on viral sequences/scaffolds. A typical workflow is to predict viruses from a metagenome (e.g., with VIBRANT or VirSorter) and then use those predictions as input to vRhyme. vRhyme can take an entire metagenome as input, but the performance for a whole metagenome has not been fully evaluated. vRhyme is not meant to bin microbes.
Why "vRhyme"? The similarity in sequence features between two scaffolds can be used to identify fragments of the same genome, such as tetranucleotide frequencies, codon usage or GC content. It's almost like rhyming sequences together to create pairs that sound similar, at least metaphorically at the nucleotide level. Coverage variance helps to separate scaffolds that sound the same but are actually different genomes.
git clone https://github.com/AnantharamanLab/vRhyme
cd vRhyme
gunzip vRhyme/models/vRhyme_machine_model_ET.sav.gz
← NOTE: vRhyme
is a subdirectory of the parent vRhyme
pip install .
← NOTE: don't forget the dot (pip install [dot]
) Installing with pip is optional but suggested. Using pip will collect Python dependencies and add vRhyme to your system PATH. Note that vRhyme.egg-info
and build
should be created after the pip install. Without pip, vRhyme can still be executed directly from the git clone, just ensure executable permissions (cd vRhyme/; chmod +x vRhyme scripts/*.py aux/*.py
). The conda environment is also optional but can be useful for downloading and managing program dependencies.
with pip install
conda create -c bioconda -n vRhyme python=3 samtools mash mummer mmseqs2 prodigal bowtie2 bwa
conda activate vRhyme
directly from git clone
conda create -c bioconda -n vRhyme python=3 networkx pandas numpy numba scikit-learn pysam samtools mash mummer mmseqs2 prodigal bowtie2 bwa
conda activate vRhyme
Test and validate the installed dependencies for vRhyme. In these tests you're looking for Success statements. All Python dependencies must be Success. Please update them if prompted. Both machine learning models must be Success. For program dependencies, only Mmseqs is required to be Success, the others are optional depending on vRhyme usage. If you use any coverage input beside -c
then Samtools is required to be Success. If you skip vRhyme -p
or -g
flags then Prodigal must be Success. If you plan to use vRhyme's dereplication function (see --method
) then Mash and Nucmer must be Success. If you input reads (-r/-v/-u
) then Bowtie2 and/or BWA (see --aligner
) must be Success.
with pip install
test_vRhyme.py
directly from git clone
aux/test_vRhyme.py
Print the main vRhyme help page:
vRhyme -h
Please ensure the following programs are installed and in your machine's PATH. Note: most downloads will automatically place these programs in your PATH. See each program's source website for installation guides.
There are several Python3 dependencies that must be installed as well. You may already have most of these installed. See each package's source website for installation guides. Versions are important.
Test out vRhyme on the provided example dataset. NOTE: If you choose to not install with pip, the vRhyme
executable is within the vRhyme/
subdirectory.
cd examples/
vRhyme -i example_scaffolds.fasta -c example_coverage_values.tsv -t 1
vRhyme -i example_scaffolds.fasta -o vRhyme_example_results_coverage-table/ -c example_coverage_values.tsv -p example_scaffolds.prodigal.faa -g example_scaffolds.prodigal.ffn -t 2
vRhyme -i fasta -b bam_folder/*.bam
vRhyme -i fasta -c coverage_file.tsv
vRhyme -i fasta -g genes -p proteins -b bam_folder/*.bam -t threads -o output_folder/
vRhyme -i fasta -g genes -p proteins -r paired_reads_folder/*.fastq -t threads -o output_folder --method longest
vRhyme -i input_fasta -t threads -o output_folder/ --derep_only --method longest
-r
. Exists with -r
. -p/-g
.aux/flag_circular.py
performs the same function and allows for modified parameters. -c
. Does not exist with -c
. -r/-u/-v/-s
. Only sorted BAM files will exist unless --keep_bam
.-r/-u/-v
and --keep_sam
.> main output folder (-o)
- log_vRhyme_(-i).log
- log_vRhyme_paired_reads.log *
- (-i).prodigal.faa *
- (-i).prodigal.ffn *
- (-i).circular.tsv
- vRhyme_best_bins.#.membership.tsv
- vRhyme_best_bins.#.summary.tsv
- vRhyme_machine_distances.tsv
> vRhyme_coverage_files *
- (sample).coverage.tsv
- vRhyme_coverage_values.tsv
- vRhyme_names.txt
> vRhyme_alternate_bins
- #.membership.tsv
- #.summary.tsv
- vRhyme_bin_scoring.tsv
> vRhyme_best_bins_fasta
- vRhyme_bin_#.faa
- vRhyme_bin_#.fasta
- vRhyme_bin_#.ffn
> vRhyme_bam_files *
- (sample).sorted.bam
- (sample).bam *
- (sample).bam.bai *
> vRhyme_sam_files *
- (sample).sam
The following bullet points are guidelines on interpreting binning results from vRhyme. Please note that this list is not exhaustive. For examples and data, please see analyses done in the vRhyme publication.
-h
for a help message with usage.
-m
flag should be used.
-i
: (required) input nucleotide scaffolds. For best results, this is predicted virus scaffolds from a metagenome or entire virome assembly. Alternatively, an entire metagenome can be used as input, but this has not been extensively tested.
-o
: output folder to deposit all results. vRhyme will exit if the folder already exists.
-g
: input nucleotide genes in the format 'name_#'.
-p
: input amino acid proteins genes in the format 'name_#'.
-b
: BAM (extension '.bam') sequence alignment files. The files can sorted or unsorted (each will be checked individually) but vRhyme will use sorted and indexed BAM files for coverage. Example usage: -b bam_file_folder/*.bam
-t
: number of parallel processes (threads) to run. With -c/-p/-g
inputs vRhyme runtime is sufficient with low -t
values (e.g., 1). vRhyme will be most efficient when -t
is set to the same value as --iter
such that 1 iteration (--iter
) is run on 1 thread (-t
) simultaneously. The same is true for whole integer multiples. For example: -t 10 --iter 10
, -t 20 --iter 10
, -t 30 --iter 15
. Note that any -t
and --iter
combination can be used.
-l
: minimum scaffold length to consider for binning. The default minimum is 2000bp (2kb) and cannot be reduced. Scaffolds <2kb do not contain enough information for accurate binning and can create erroneous results.
NOTE: For FASTA inputs (-i/-g/-p
) most characters are accepted with few exceptions (e.g., pipe symbol |
and comma ,
).
NOTE: -g
and -p
can be skipped and Prodigal will be used to predict open reading frames
NOTE: For -g/-p
, the format 'name#' is Prodigal format. 'name' can be any sequence name with '#' designating the protein/gene number. For example, 'example_sequence_1' and 'example_sequence_2' is protein/gene 1 and 2.
NOTE: Prodigal is run with -m -p meta
.
-s
: SAM (extension '.sam') sequence alignment files. vRhyme will convert these files directly to BAM format before processing. Example usage: -s sam_file_folder/*.sam
-r
: paired forward (_1.fastq/_R1.fastq) and reverse (_2.fastq/_R2.fastq) read files. The extensions and suffix terminology is important to ensure proper pairing during alignment. With this input a file called 'log_vRhyme_paired_reads.tsv' will be created to validate that correct pairing was identified. Example usage: -r paired_reads_folder/*.fastq
-u
: unpaired (.fastq) read file(s). This is for single end sequencing and will not be considered with -r
. Example usage: -u unpaired_reads_folder/*.fastq
-v
: interleaved paired (.fastq) read file(s). Compatible with --aligner bowtie2
only. Example usage: -v interleaved_reads_folder/*.fastq
-c
: coverage table input. See the example run for formatting if you want to generated this file manually. This may be useful if re-binning after a coverage table has already be generated by vRhyme or if an error caused vRhyme to exit after the coverage table was created. This input can also come from using cov_table_convert.py on a coverage table generated by 'jgi_summarize_bam_contig_depths'.
--interest
: file containing a list of scaffolds of interest to dereplicate and/or bin. Use this if your input (-i
) file contains more scaffolds than what you want to bin or dereplicate. The list is new line separated.
NOTE: all read inputs can be in gzip (.gz) format
NOTE: It is highly suggested to bin using at least 2-3 samples for coverage covariance comparisons (e.g., 3 -b
input files). vRhyme will still function and bin with 1 sample.
NOTE: BWA is run with 'mem'. A BWA database/index will be created if it doesn't exist in the location of the -i
file. Use Bowtie2 or separate read map software of choice for interleaved reads or if reads are improperly formatted for BWA.
NOTE: Bowtie2 is run with '--no-unal --no-discordant' using -r
or -v
. A Bowtie2 database/index will be created if it doesn't exist in the location of the -i
file.
--keep_sam
: (bool, just use the flag) If used, do not remove SAM files generated by vRhyme. Otherwise all SAM files generated from vRhyme read alignment will be deleted. All input SAM files (-s
) are not effected.--keep_bam
: (bool, just use the flag) If used, do not remove unsorted/index BAM files generated by vRhyme. Otherwise all unsorted/index BAM files generated from vRhyme read alignment or SAM conversions will be deleted. All input and final sorted BAM files (-b
) are not effected.--speed
: (bool, just use the flag) If used, allow extra CPU multithreading per process (may use more CPUs than -t #
per process). This is effected by the environments OMP_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, VECLIB_MAXIMUM_THREADS, or NUMEXPR_NUM_THREADS.--verbose
: (bool, just use the flag) If used, write log file information to standard out while running. This is identical to what is written to the log file.--prefix
: The prefix to append to binned scaffold names. The default is 'vRhyme#_'. Use the '#' symbol where the bin number will be noted. For example, by using --prefix BIN_#__
bin 1 will have the prefix 'BIN_1__', bin 2 will be 'BIN_2__' and so on. --aligner
: read alignment software to use. Either 'bowtie2' (default) or 'bwa'. Other alignment software are not supported at this time. Use BAM (-b
), SAM (-s
) or coverage table (-c
) inputs for alignments from other software.--read_id
: minimum percent identity per aligned read for calculating coverage. The default is '0.97' (97%). Percent identity is calculated per read as the number of exact matches divided by the length of the aligned read. Exact matches are calculated by the length of the aligned read minus the number of gaps and mismatches. This is the quality trimming of the read alignment as some reads are aligned at poor identity depending on alignment software settings. This is compatible with reads file inputs or BAM/SAM inputs.--mask
: mask coverage values <\int> bases on each end of a scaffold. The default is 150. For short read alignments the coverages can be skewed at scaffold ends and impact coverage covariance calculations. Masking (not including) these coverage values on each end can produce better results. The default of 150 is suitable for most situations, but changing this value to the average read length may be beneficial.--outliers
: remove outlier coverage values -c
coverage table input. --bin_size
: minimum number of scaffolds per bin. The default is 2. For example, setting this to 5 means all reported bins must contain at least 5 scaffolds. Low values, such as 2, are best for binning viruses that have smaller genomes.--iter
: number of binning iterations (presets) to run. The default is 20 and the available range is 10-20. The presets are immutable threshold settings for the machine learning model probability, Cohen's d coverage effect size, and network edge weight cutoffs. Each iteration will run a single preset. The minimum setting of 10 is usually sufficient to identify the best binning iteration, though there isn't a significant effect on runtime by leaving this at the default maximum of 20.--red
: maximum number of redundant proteins per bin. The default is 50. This is a cutoff for how many of a bin's proteins can be redundant for generating/reporting the bin. The default of 50 means most bins are reported. For example, setting --red 1
would result in only creating bins with <=1 redundant protein, which is a typical low-contamination bin. Redundant proteins contained on a single scaffold are not considered redundant. Modifying this value does not impact how bins are formed, only how bins are reported. Redundancy per bin is reported to allow for filtering post-binning. Note that some eukaryotic viruses, and few prokaryotic viruses, can normally have up to 10 redundant proteins. Uncommon exceptions may have greater than 10. --cov
: coverage threshold to consider scaffold as present in sample. The default is 0.80. Any scaffold with an average coverage below this threshold is not considered as present. Typically, lowering this value increases recall along with contamination, whereas increasing this value lowers recall and contamination.--model
: machine learning model(s) to use. The default is 'hybrid', which means both the Neural Network (NN) and Extra Trees (ET) models are used. There is a minor reduction in runtime when using both models, but it is more accurate than either on their own. When both models are used the probability output from both models needs to meet the given threshold.--max_gc
: maximum GC distance for pre-filtering. The default is 0.20. This is the difference in %GC between two scaffolds being compared. Scaffolds from the same genome tend to have similar GC content. Lowering this value (e.g., 0.10 or 0.15) will minorly increase speed for large datasets because fewer scaffolds will be compared with more computationally expensive tasks. Setting this value to 1.00 will turn off any filtering.--min_kmer
: minimum k-mer distance for pre-filtering. The default is 0.60. This is the calculated distance between two scaffolds' tetranucleotide frequencies. Scaffolds that are more alike will have higher values (e.g., >0.80) and those that are more dissimilar will be lower (e.g., <0.40). Increasing this value (e.g., 0.70 or 0.80) will minorly increase speed for large datasets because fewer scaffolds will be compared with more computationally expensive tasks. Setting this value to 0.00 will turn off any filtering. Setting this value too high (e.g., >0.90) may greatly reduce recall but only the most similar scaffolds by k-mer usage will be compared.--max_edges
: maximum number of edges per node in network clustering. The default is a dynamic range between 3 and 6 (int) depending on the number of input scaffolds (max_edgs = -0.001x + 6.125, where x is input scaffolds). Lowering this value will decrease connections between scaffolds and increase the reliability of the connections at the expense of false negative interactions. With values lower than 4, large viral genomes may be split between bins if highly fractionated but complex viromes are binned more reliably. --penalty_w
: penalty weight for Cohen's d distance calculations. The default is 0.20. Each Cohen's d value is compared to the respective preset threshold (see --iter
). Any values greater than the threshold are given a penalty. Lower penalty weights will be less sensitive to coverage differences between scaffolds, and likewise greater penalty weights will be more strict. --penalty_n
: maximum number of penalties for Cohen's d distance calculations. The default is 10 with a maximum of <30% of input sequences. Each Cohen's d value is compared to the respective preset threshold (see --iter
). Any values greater than the threshold are given a penalty. Increasing the maximum allowed penalties will be less sensitive to coverage differences between scaffolds (higher chance of false positive), and likewise decreasing maximum allowed penalties will be more strict (lower chance of false positive).--mems
: refine bins with at least N members. The default is 4. Any potential bin with at least N members will be subject to bin refinement. Decreasing this value typically has minimal effects. Increasing this value may lead to greater contamination, but be less likely to refine (e.g., split apart) large viral genomes with many scaffolds. --derep_only
: (bool, just use the flag) Only use dereplication function, skip binning. This is off by default. If set, binning will be skipped along with any binning-specific flags. Do not use this flag if you want to bin and dereplicate. --method
: method of input scaffold dereplication. Options: none, longest, composite. The default is 'none', meaning dereplication by default is off. To turn on dereplication, set this flag to either 'longest' or 'composite'. Setting to 'longest' will dereplicate scaffolds and keep the longest representative. Setting to 'composite' will dereplicate identical scaffolds and combine overlapping scaffolds into composite sequences. 'Composite' is compatible with reads input or --derep_only
.-m
: minimum scaffold length to consider for dereplication. The default minimum is 1000bp (1kb) and cannot be reduced. Note that this value is different than -l
. --derep_id
: minimum percent similarity for dereplication. The default for 'composite' method is 0.99 and for 'longest' method is 0.97. The values are input as decimal points where 0.97 = 97%. --frac
: minimum coverage in overlap percent for dereplication. The default for 'composite' method is 0.20 and for 'longest' method is 0.70. The values are input as decimal points where 0.70 = 70%.
--mash_k
: equivalent to mash sketch -k. The default is 31.--mash_s
: equivalent to mash sketch -s. The default is 1000.--sens_ends
: free ends sensitivity, as fraction of scaffold length, to merging over regions of complexity; requires --method composite
. The default is 0.10.--sens_overlap
: alignment overlap sensitivity, as fraction of scaffold length, to merging over regions of complexity; requires --method composite
. The default is 0.25.--nucmer_c
: equivalent to nucmer -c. The default is 1000. --nucmer_b
: equivalent to nucmer -b. The default is 1000. --nucmer_g
: equivalent to nucmer -g. The default is 1000. --nuc_split
: maximum fragmentations allowed per scaffold in nucmer alignment. The default is 3. Increasing this value will allow for more gapped alignments and decreasing will be more strict. Please contact Kristopher Kieft (kieft@wisc.edu or GitHub Issues) with any questions, concerns or comments.
Thank you for using vRhyme!
______________________________________________________________________
## ## ## ##
## ## ## ## ## ## ## ## # ## ##
## ## ## ## ## ## ## ## ## ## ## #
## ## ## ## ## ## ## ## ## ## ## ## ##
## ## ## #### ## ## ## ## ## ## ## ##
## ## ## ## ## ## ## ## ## ## ##
### ## ## ## ## ## ## ## ## ## ## ##
______________________________________________________________________
vRhyme Copyright (C) 2022 Kristopher Kieft
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.