Kevin Jacobs jacobs@bioinfomed.com
vgraph
is a command line application and Python library to compare
genetic variants using variant graphs. Conventional methods used to compare
variants apply heuristic normalization rules and then compare variants
individually by matching based on genomic position and allele information.
In contrast, vgraph
utilizes a graph representation of genomic variants
to precisely compare complex variants that are refractory to comparison
using conventional methods.
vgraph
currently accepts block gzipped and indexed VCF and BCF files.
Support for Complete Genomics var
and masterVar
formats will be
added in a future version. vgraph
also requires an indexed reference
genome in FASTA+FAI format.
vgraph
outputs diagnostic only information to stdout. In repmatch
mode, there are options to output either of the two input files with match
status annotations. In dbmatch
mode, the sample
input file is output
after copying all new INFO and FORMT annotations from the database
file.
vgraph
takes the following command line parameters::
usage: vgraph [-h] [--debug] [--profile] {repmatch,dbmatch} ...
positional arguments:
{repmatch,dbmatch} Commands
repmatch compare two replicate samples
dbmatch compare a database of alleles to a sample
optional arguments:
-h, --help show this help message and exit
--debug Output extremely verbose debugging information
--profile Profile code performance
The parameters for repmatch
are::
usage: vgraph repmatch [-h] [--out1 OUT1] [--out2 OUT2] [--name1 N]
[--name2 N] --reference FASTA [-p N]
[--include-regions BED] [--exclude-regions BED]
[--include-file-regions BED]
[--exclude-file-regions BED] [--include-filter F]
[--exclude-filter F] [--min-gq N]
vcf1 vcf2
positional arguments:
vcf1 Sample 1 VCF/BCF input (- for stdin)
vcf2 Sample 2 VCF/BCF input (- for stdin)
optional arguments:
-h, --help show this help message and exit
--out1 OUT1 Sample 1 VCF/BCF output (optional)
--out2 OUT2 Sample 2 VCF/BCF output (optional)
--name1 N Name or index of sample in sample 1 file (default=0)
--name2 N Name or index of sample in sample 2 file (default=0)
--reference FASTA Reference FASTA+FAI (required)
-p N, --reference-padding N
Pad variants by N bp when forming superloci
(default=2)
--include-regions BED
BED file of regions to include in comparison
--exclude-regions BED
BED file of regions to exclude from comparison
--include-file-regions BED
BED file of regions to include for each input file
--exclude-file-regions BED
BED file of regions to exclude from comparison for
each input file
--include-filter F Include records with filter status F. Option may be
specified multiple times or F can be comma delimited
--exclude-filter F Exclude records with filter status F. Option may be
specified multiple times or F can be comma delimited
--min-gq N Exclude records with genotype quality (GQ) < N
The parameters for dbmatch
are::
usage: vgraph dbmatch [-h] [--name N] [-o OUTPUT] --reference FASTA [-p N]
[--include-regions BED] [--exclude-regions BED]
[--include-file-regions BED]
[--exclude-file-regions BED] [--include-filter F]
[--exclude-filter F] [--min-gq N]
database sample
positional arguments:
database Database of alleles VCF/BCF input (- for stdin)
sample Sample VCF/BCF input (- for stdin)
optional arguments:
-h, --help show this help message and exit
--name N Name or index of sample in sample file (default=0)
-o OUTPUT, --output OUTPUT
Sample VCF/BCF output
--reference FASTA Reference FASTA+FAI (required)
-p N, --reference-padding N
Pad variants by N bp when forming superloci
(default=2)
--include-regions BED
BED file of regions to include in comparison
--exclude-regions BED
BED file of regions to exclude from comparison
--include-file-regions BED
BED file of regions to include for each input file
--exclude-file-regions BED
BED file of regions to exclude from comparison for
each input file
--include-filter F Include records with filter status F. Option may be
specified multiple times or F can be comma delimited
--exclude-filter F Exclude records with filter status F. Option may be
specified multiple times or F can be comma delimited
--min-gq N Exclude records with genotype quality (GQ) < N
Before vgraph
may be installed, your systems requires a C compiler, a
functioning version of Python 3.5 or newer with development libraries
installed, and the pip
installer. The steps to install and ensuring
these tools are functional depend on your operating system and personal
configuration. Proceed only once these pre-requisites are available.
First install the latest version of the Cython and pysam packages::
pip install -U Cython
pip install -U pysam
If all these steps have succeeded, then install vgraph
::
pip install -U git+https://github.com/bioinformed/vgraph.git