SegataLab / viromeqc

ViromeQC is a computational tool to benchmark and quantify non-viral contamination in VLP-enrihed viromes. ViromeQC provides an enrichment score for each virome. The score is calculated with respect to the expected prokaryotic markers abundances in reference metagenomes
MIT License
15 stars 1 forks source link
virome

ViromeQC

Description

Requires:

Update: ViromeQC now works with newer versions of diamond (e.g. v0.9.29) Thanks to Ryan Cook (@RyanCookAMR) for the new diamond db

Usage

Step 1: clone or download the repository

git clone --recurse-submodules https://github.com/SegataLab/viromeqc.git

or download the repository from the releases page

Step 2: install the database:

This steps downloads the database file. This needs to be done only the first time you run ViromeQC. This may require a few minutes, depending on your internet connection.

viromeQC.py --install

Alternatively, you can also download the database files from Zenodo. Once downloaded the files, create a folder named index/ in the ViromeQC installation folder and unzip all the files in this folder.

Step 3: Run on your sample

viromeQC.py -i <input_virome_file(s)> -o <report_file.txt>

Please Note: You can pass more than one file as input (e.g. for multiple runs or paired end reads). However, you can process only one sample at a time with this command. If you want to parallelize the execution, this can be easily done with Parallel or equivalent tools.

You can try the test example (test/test.sh) which analyzes 10'000 reads from the sample SRR829034. This should take approximately 1 or 2 minutes.

Parameters:

usage: viromeQC.py -i <input_virome_file> -o <report_file.txt>

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Raw Reads in FASTQ format. Supports multiple inputs
                        (plain, gz o bz2) (default: None)
  -o OUTPUT, --output OUTPUT
                        output file (default: None)
  --minlen MINLEN       Minimum Read Length allowed (default: 75)
  --minqual MINQUAL     Minimum Read Average Phred quality (default: 20)
  --bowtie2_threads BOWTIE2_THREADS
                        Number of Threads to use with Bowtie2 (default: 4)
  --diamond_threads DIAMOND_THREADS
                        Number of Threads to use with Diamond (default: 4)
  -w {human,environmental}, --enrichment_preset {human,environmental}
                        Calculate the enrichment basing on human or
                        environmental metagenomes. Defualt: human-microbiome
                        (default: human)
  --bowtie2_path BOWTIE2_PATH
                        Full path to the bowtie2 command to use, deafult
                        assumes that bowtie2 is present in the system path
                        (default: bowtie2)
  --diamond_path DIAMOND_PATH
                        Full path to the diamond command to use, deafult
                        assumes that diamond is present in the system path
                        (default: diamond)
  --version             Prints version informations (default: False)
  --install             Downloads database files (default: False)
  --sample_name SAMPLE_NAME
                        Optional label for the sample to be included in the
                        output file (default: None)
  --tempdir TEMPDIR     Temporary Directory override (default is the system
                        temp. directory) (default: None)

Pipeline structure

ViromeQC starts from FASTQ files (compressed files are supported), and will:

  1. Elimitate short and low quality reads
    • adjust the minqual and minlen parameters if you want to change the thresholds
  2. Map the reads against a curated collection of rRNAs and single-copy bacteral markers
  3. Filter the reads to remove short and dlsivergent alignments
  4. Compute the enrichment value of the sample, compared to the median observed in human metagenomes
    • use -w environmental for envronmental reads
    • reference medians for un-enriched metagenomes are taken from medians.csv, you can provide your own data to ViromeQC by changing this file accordingly
  5. Produce a report file with the alignment rates and the final enrichment score (which is the minimum enrichment observed across SSU-rRNA, LSU-rRNA and single-copy markers)

Output

Output is given as a TSV file with the following structure:

Sample Reads Reads_HQ SSU rRNA alignment (%) LSU rRNA alignment (%) Bacterial_Markers alignment (%) total enrichmnet score
your_sample.fq 40000 39479 0.00759898 0.0227969 0.01266496 5.795329

Citation

If you find this tool useful, please cite:

Zolfo, M., Pinto, F., Asnicar, F., Manghi, P., Tett A., Segata N. Detecting contamination in viromes using ViromeQC, Nature Biotechnology 37, 1408–1412 (2019)