TLDR; 1,000x+ faster than VEP, more complete annotation + online search (https://bystro.io) for datasets of up to 47TB (compressed) online, or petabytes offline.
For datasets and scripts used, please visit github.com/bystro-paper
If using Bystro, please cite Kotlar et al, Genome Biology, 2018
Start here: TUTORIAL.md
For most users, we recommend https://bystro.io .
The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.
Bystro consists of 2 main components: the Bystro Python package, which consists of the Bystro ML library, CLI tool, and a collection of easy to use biology tools including global ancestry and the Bystro annotator (Perl).
The Bystro Python package also gives the ability to launch workers to process jobs from the Bystro API server, but this is not necessary for most users.
To install the Bystro Python package, run:
pip install --pre bystro
The Bystro ancestry CLI score
tool (bystro-api ancestry score
) parses VCF files to generate dosage matrices. This requires bystro-vcf
, a Go program which can be installed with:
# Requires Go: install from https://golang.org/doc/install
go install github.com/bystrogenomics/bystro-vcf@2.2.2
Bystro is compatible with Linux and MacOS. Windows support is experimental. If you are installing on MacOS as a native binary (Arm), you will need to install the following additional dependencies:
brew install cmake
Please refer to INSTALL.md for more details.
Please refer to INSTALL.md for instructions on how to install the Bystro annotator.
Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.
Please read FIELDS.md
It has several keys:
tracks
: The highest level organization for database values. Tracks have a name
property, which must be unique, and a type
, which must be one of:
sparse: A bed file, or any file that can be mapped to chrom
, chromStart
, and chromEnd
columns.
fieldMap
keyscore: A wigFix file.
cadd:
gene: A UCSC gene track table (ex: knownGene, refGene, sgdGene) stored as a tab separated output, with column names as columns. Conversion from SQL to the expected tab-delimited format is controlled by bin/bystro-utils.pl, which will automatically fetch the requested sql, and generate the tab-delimited output.
For instance: For a config file that has the following track
chromosomes:
- chr1
tracks:
tracks:
- name: refSeq
type: gene
utils:
- args:
connection:
database: hg19
sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
(SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
(SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
refGene r WHERE chrom=%chromosomes%;
Running bin/bystro-utils.pl --config <path/to/this/config>
will result in the following config:
chromosomes:
- chr1
tracks:
tracks:
- name: refSeq
type: gene
local_files:
- hg19.kgXref.chr1.gz
name: refSeq
type: gene
utils:
- args:
connection:
database: hg19
sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
(SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
(SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
'')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
refGene r WHERE chrom=%chromosomes%;
completed: <date fetched>
name: fetch
hg19.kgXref.chr1.gz
will contain:
bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames kgID description ensemblID tRnaName spID spDisplayID protAcc mRNA rfamAcc
0 NM_001376542 chr1 + 66999275 67216822 67000041 67208778 25 66999275,66999928,67091529,67098752,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 66999620,67000051,67091593,67098777,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67216822, 0 SGIP1 cmpl cmpl -1,0,1,2,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1, NA NA NA NA NA NA NA NA NA
nearest: A pre-calculated gene
track that is intersected with a target gene
track.
Example:
- name: refSeq.gene
dist: false
storeNearest: true
to: txEnd
type: nearest
features:
- name2
from: txStart
local_files:
- hg19.kgXref.chr*.gz
Options:
dist
: boolvcf: A VCF v4.* file
chromosomes
: The allowable chromosomes.
Each row of every track must be identified by these chromosomes (during building)
Each row of any input file submitted for annotation must also be "" "" (during annotation)
However, Bystro is flexible about the chr prefix
Ex: For the following config
chromosomes:
- chr1
- chr2
- chr3
Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy
chromosomes
, meaning they should be prepended by chrEx: Clinvar doesn't have a chr prefix, so during building we specify:
tracks:
- name: clinvar
build_field_transformations:
chrom: chr .
fieldMap:
Chromosome: chrom
Here fieldMap
allows us to rename header fields, and build_field_transformations
allows us to define a prepend operation (chr .
can be interpreted as the perl command "chr" . $chrom
)
So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.
In this example chromosomes 1
and chr1
will be built/annotated, but 1_rand
will not.
These describe where the Bystro database and any source files are located.
files_dir
: The parent folder within which each track's local_files
are locatedBystro automatically checks for local_files
at parent/trackName/file
Ex: For the config file containing
files_dir: /path/to/files/
track:
- name: refSeq
local_files:
- hg19.refGene.chr1.gz
# and more files
Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz
database_dir
: Each database is held within database_dir
, in a folder of the name assembly
Ex: For the config file containing
assembly: hg19
database_dir: /path/to/databases/
Bystro will look for the database /path/to/databases/hg19