DennisSchmitz / Jovian_archive

Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
GNU Affero General Public License v3.0
18 stars 7 forks source link
clinical diagnostics erasmus metagenomics ngs public-health rivm surveillance virology viromics virus-typing viruses

Jovian
A user-friendly Viromics toolkit

Github release licence Snakemake Version

For Citations, please use the following DOI:
Zenodo DOI

See the documentation:
Jovian Docs
Or view an example notebook:
Launch an example notebook

IMPORTANT: manuscript is in preparation


Table of contents


About Jovian

Jovian is a Public Health toolkit to automatically process raw NGS data from human clinical matrices (faeces, serum, etc.) into clinically relevant information. It has three main components:

Key features of Jovian:




Commands

:memo: Please see the full Command Line Reference on the documentation site for a more detailed explanation of each command, including example commands for starting an analysis or common usage examples.

Here, we have a short list of commands and use cases that are used very frequently.

Use case 1:
Metagenomic analylsis based on Illumina data:

bash jovian illumina-metagenomics -i <INPUT DIRECTORY>

Use case 2:
Align Illumina data against a user-provided reference to generate a consensus genome:

bash jovian illumina-reference -i <INPUT DIRECTORY> -ref <REFERENCE FASTA>

Use case 3:
Align Nanopore (multiplex) PCR data against a user-provided reference, remove overrepresented primer sequences, and generate a consensus genome:

bash jovian nanopore-reference -i <INPUT DIRECTORY> -ref <REFERENCE FASTA> -pr <PRIMER FASTA>

use bash jovian -h to see a full list of commands applicable to the Jovian version that you're using.


Features

:memo: Please refer to our documentation for the full list of features

General features

Metagenomics specific features

Reference-alignment specific features

Visualizations

All data are visualized via an interactive web-report, as shown here, which includes:

Virus typing

After a Jovian analysis is finished you can perform virus-typing (i.e. sub-species level taxonomic labelling). These analyses can be started by the command bash jovian -vt [virus keyword], where [virus keyword] can be:

Keyword Taxon used for scaffold selection Notable virus species
NoV Caliciviridae Norovirus GI and GII, Sapovirus
EV Picornaviridae Enteroviruses (Coxsackie, Polio, Rhino, etc.), Parecho, Aichi, Hepatitis A
RVA Rotavirus A Rotavirus A
HAV Hepatovirus A Hepatitis A
HEV Orthohepevirus A Hepatitis E
PV Papillomaviridae Human Papillomavirus
Flavi Flaviviridae Dengue (work in progress)
all All of the above All of the above

Audit trail

An audit trail, used for clinical reproducibility and logging, is generated and contains:

However, it has limitations since several things are out-of-scope for Jovian to control:


Jovian Illumina Metagenomics workflow visualization Click the image for a full-sized version Jovian Illumina Metagenomics workflow


Jovian Illumina Reference alignment workflow visualization Click the image for a full-sized version Jovian Illumina Reference workflow


Jovian Nanopore Reference alignment workflow visualization Click the image for a full-sized version Jovian Nanopore reference workflow

Requirements

:memo: Please refer to our documentation for a detailed overview of the Jovian requirements here


Installation

:memo: Please refer to our documentation for detailed instructions regarding the installation of Jovian here

Usage instructions

General usage instructions vary for each workflow that we support.
Please refer to the link below corresponding to the workflow that you wish to use


FAQ

Can be found here.


Example Jovian report

Can be found here.


Acknowledgements

Name Publication Website
BBtools NA https://jgi.doe.gov/data-and-tools/bbtools/
BEDtools Quinlan, A.R. and I.M.J.B. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. 2010. 26(6): p. 841-842. https://bedtools.readthedocs.io/en/latest/
BLAST Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. 25(17): p. 3389-3402. https://www.ncbi.nlm.nih.gov/books/NBK279690/
BWA Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997. https://github.com/lh3/bwa
BioConda Grüning, B., et al., Bioconda: sustainable and comprehensive software distribution for the life sciences. 2018. 15(7): p. 475. https://bioconda.github.io/
Biopython Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423. https://biopython.org/
Bokeh Bokeh Development Team (2018). Bokeh: Python library for interactive visualization. https://bokeh.pydata.org/en/latest/
Bowtie2 Langmead, B. and S.L.J.N.m. Salzberg, Fast gapped-read alignment with Bowtie 2. 2012. 9(4): p. 357. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Conda NA https://conda.io/
DRMAA NA http://drmaa-python.github.io/
FastQC Andrews, S., FastQC: a quality control tool for high throughput sequence data. 2010. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
gawk NA https://www.gnu.org/software/gawk/
GNU Parallel O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014. https://www.gnu.org/software/parallel/
Git NA https://git-scm.com/
igvtools NA https://software.broadinstitute.org/software/igv/igvtools
Jupyter Notebook Kluyver, Thomas, et al. "Jupyter Notebooks-a publishing format for reproducible computational workflows." ELPUB. 2016. https://jupyter.org/
Jupyter_contrib_nbextension NA https://github.com/ipython-contrib/jupyter_contrib_nbextensions
Jupyterthemes NA https://github.com/dunovank/jupyter-themes
Krona Ondov, B.D., N.H. Bergman, and A.M. Phillippy, Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 2011. 12: p. 385. https://github.com/marbl/Krona/wiki
Lofreq Wilm, A., et al., LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. 2012. 40(22): p. 11189-11201. http://csb5.github.io/lofreq/
MGkit Rubino, F. and Creevey, C.J. 2014. MGkit: Metagenomic Framework For The Study Of Microbial Communities. . Available at: figshare [doi:10.6084/m9.figshare.1269288]. https://bitbucket.org/setsuna80/mgkit/src/develop/
Minimap2 Li, H., Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 2018. https://github.com/lh3/minimap2
MultiQC Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. 2016. 32(19): p. 3047-3048. https://multiqc.info/
Nb_conda NA https://github.com/Anaconda-Platform/nb_conda
Nb_conda_kernels NA https://github.com/Anaconda-Platform/nb_conda_kernels
Nginx NA https://www.nginx.com/
Numpy Walt, S. V. D., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2), 22-30. http://www.numpy.org/
Pandas McKinney, W. Data structures for statistical computing in python. in Proceedings of the 9th Python in Science Conference. 2010. Austin, TX. https://pandas.pydata.org/
Picard NA https://broadinstitute.github.io/picard/
Prodigal Hyatt, D., et al., Prodigal: prokaryotic gene recognition and translation initiation site identification. 2010. 11(1): p. 119. https://github.com/hyattpd/Prodigal/wiki/Introduction
Python G. van Rossum, Python tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, May 1995. https://www.python.org/
Qgrid NA https://github.com/quantopian/qgrid
SAMtools Li, H., et al., The sequence alignment/map format and SAMtools. 2009. 25(16): p. 2078-2079. http://www.htslib.org/
SPAdes Nurk, S., et al., metaSPAdes: a new versatile metagenomic assembler. Genome Res, 2017. 27(5): p. 824-834. http://cab.spbu.ru/software/spades/
seqkit Shen, Wei, et al. "SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation." PloS one 11.10 (2016). https://github.com/shenwei356/seqkit
Seqtk NA https://github.com/lh3/seqtk
Snakemake Köster, J. and S.J.B. Rahmann, Snakemake—a scalable bioinformatics workflow engine. 2012. 28(19): p. 2520-2522. https://snakemake.readthedocs.io/en/stable/
Tabix NA www.htslib.org/doc/tabix.html
tree NA http://mama.indstate.edu/users/ice/tree/
Trimmomatic Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20. www.usadellab.org/cms/?page=trimmomatic
Virus-Host Database Mihara, T., Nishimura, Y., Shimizu, Y., Nishiyama, H., Yoshikawa, G., Uehara, H., ... & Ogata, H. (2016). Linking virus genomes with host taxonomy. Viruses, 8(3), 66. http://www.genome.jp/virushostdb/note.html
Virus typing tools Kroneman, A., Vennema, H., Deforche, K., Avoort, H. V. D., Penaranda, S., Oberste, M. S., ... & Koopmans, M. (2011). An automated genotyping tool for enteroviruses and noroviruses. Journal of Clinical Virology, 51(2), 121-125. https://www.ncbi.nlm.nih.gov/pubmed/21514213

Authors


This project/research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 643476. and the Dutch working group on molecular diagnostics (WMDI).