bystrogenomics / bystro

Bystro genetic analysis (annotation, filtering, statistics)
Apache License 2.0
42 stars 13 forks source link
bioinformatics bioinformatics-algorithms bioinformatics-analysis bioinformatics-databases bioinformatics-pipeline bioinformatics-scripts genomics genomics-search

Bystro DOI Codacy Badge

TLDR; 1,000x+ faster than VEP, more complete annotation + online search (https://bystro.io) for datasets of up to 47TB (compressed) online, or petabytes offline.

Bystro Performance

Bystro Publication

For datasets and scripts used, please visit github.com/bystro-paper

If using Bystro, please cite Kotlar et al, Genome Biology, 2018

Web Tutorial

Start here: TUTORIAL.md

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

Bystro consists of 2 main components: the Bystro Python package, which consists of the Bystro ML library, CLI tool, and a collection of easy to use biology tools including global ancestry and the Bystro annotator (Perl).

The Bystro Python package also gives the ability to launch workers to process jobs from the Bystro API server, but this is not necessary for most users.

Installing the Bystro Python libraries and CLI tools

To install the Bystro Python package, run:

pip install --pre bystro

The Bystro ancestry CLI score tool (bystro-api ancestry score) parses VCF files to generate dosage matrices. This requires bystro-vcf, a Go program which can be installed with:

# Requires Go: install from https://golang.org/doc/install
go install github.com/bystrogenomics/bystro-vcf@2.2.2

Bystro is compatible with Linux and MacOS. Windows support is experimental. If you are installing on MacOS as a native binary (Arm), you will need to install the following additional dependencies:

brew install cmake

Please refer to INSTALL.md for more details.

Installing the Bystro Annotator

Please refer to INSTALL.md for instructions on how to install the Bystro annotator.

File support

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

  1. VCF format: Bystro-Vcf
  2. SNP format: Bystro-SNP
  3. Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

Directories and Files

These describe where the Bystro database and any source files are located.

  1. files_dir : The parent folder within which each track's local_files are located
  1. database_dir : Each database is held within database_dir, in a folder of the name assembly

    Ex: For the config file containing

    assembly: hg19
    database_dir: /path/to/databases/

    Bystro will look for the database /path/to/databases/hg19