Echtvar efficiently encodes variant allele frequency and other information from huge population datasets to enable rapid (1M variants/second) annotation of genetic variants. It chunks the genome into 1<<20 (~1 million) bases, encodes each variant into a 32 bit integer (with a supplemental table for those that can't fit due to large REF and/or ALT alleles). It uses the zip format, delta encoding and integer compression to create a compact and searchable format of any integer, float, or low-cardinality string columns selected from the population file.
read more at the why of echtvar
Get a static binary and pre-encoded echtvar files for gnomad v3.1.2 (hg38) here: https://github.com/brentp/echtvar/releases/latest That page contains exact instructions to get started with the static binary.
To run echtvar with an existing archive (we have several available in releases) is as simple as
echtvar anno -e gnomad.echtvar.zip -e other.echtvar.zip input.vcf output.annotated.bcf
an optional filter that utilizes fields available any of the zip files can be added like:
-i "gnomad_popmax_af < 0.01"
echtvar can also accept input from stdin using "-" or "/dev/stdin" for the input argument.
make (encode
) a new echtvar file. This is usually done once (or download from those provided in the Release pages)
and then the file can be re-used for the annotation (echtvar anno
) step with each new query file.
Note that input VCFs must be decomposed.
echtvar \
encode \
gnomad.v3.1.2.echtvar.zip \
conf.json # this defines the columns to pull from $input_vcf, and how to
$input_population_vcf[s] \ can be split by chromosome or all in a single file.
name and encode them
See below for a description of the json file that defines which columns are pulled from the population VCF.
Annotate a decomposed (and normalized) VCF with an echtvar file and only output variants where gnomad_af
from the echtvar file is < 0.01. Note that multiple echtvar files can be specified
and the -i
expression is optional and can be elided to output all variants.
echtvar anno \
-e gnomad.v3.1.2.echtvar.v2.zip \
-e dbsnp.echtvar.zip \
-i 'gnomad_popmax_af < 0.01' \
$cohort.input.bcf \
$cohort.echtvar-annotated.filtered.bcf
When running echtvar encode
, a json5 (json with
comments and other nice features) determines which columns are pulled from the
input VCF and how they are stored.
A simple example is to pull a single integer field and give it a new name (alias
):
[{"field": "AC", "alias": "gnomad_AC"}]
This will extract the "AC" field from the INFO and labeled as "gnomadAC" when later used to annotate a VCF. Note that it's important to give a description/unique prefix lke "`gnomad`" so as not to collide with fields already in the query VCF.
Other examples are available here
And full examples are in the wiki
An optional expression will determine which variants are written. It can utilize any (and only) integer or float fields present in the echtvar file (not those present in the query VCF). An example could be:
-i 'gnomad_af < 0.01 && gnomad_nhomalts < 10'
The expressions are enabled by fasteval with supported syntax detailed here.
In brief, the normal operators: (&&, ||, +, -, *, /, <, <=, >, >=
and groupings (, )
, etc) are supported and can be used to
craft an expression that returns true or false as above.
Without these (and other) critical libraries, echtvar
would not exist.
echtvar
is developed in the Jeroen De Ridder lab