A C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.
The documentation website provides comprehensive usage examples but here are a few quick exercises to get you started.
We'll use a dataset that includes 20 synthetic samples, each one containing over 20 million variants. We host a publicly accessible version of this dataset on S3, so if you have TileDB-VCF installed and you'd like to follow along just swap out the uri
's below for s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20
. And if you don't have TileDB-VCF installed yet, you can use our Docker images to test things out.
Export complete chr1 BCF files for a subset of samples:
tiledbvcf export \
--uri vcf-samples-20 \
--regions chr1:1-248956422 \
--sample-names v2-usVwJUmo,v2-WpXCYApL
Create a TSV file containing all variants within one or more regions of interest:
tiledbvcf export \
--uri vcf-samples-20 \
--sample-names v2-tJjMfKyL,v2-eBAdKwID \
-Ot --tsv-fields "CHR,POS,REF,S:GT" \
--regions "chr7:144000320-144008793,chr11:56490349-56491395"
Running the same query in python
import tiledbvcf
ds = tiledbvcf.Dataset(uri = "vcf-samples-20", mode="r")
ds.read(
attrs = ["sample_name", "pos_start", "fmt_GT"],
regions = ["chr7:144000320-144008793", "chr11:56490349-56491395"],
samples = ["v2-tJjMfKyL", "v2-eBAdKwID"]
)
returns results as a pandas DataFrame
sample_name pos_start fmt_GT
0 v2-nGEAqwFT 143999569 [-1, -1]
1 v2-tJjMfKyL 144000262 [-1, -1]
2 v2-tJjMfKyL 144000518 [-1, -1]
3 v2-nGEAqwFT 144000339 [-1, -1]
4 v2-nzLyDgYW 144000102 [-1, -1]
.. ... ... ...
566 v2-nGEAqwFT 56491395 [0, 0]
567 v2-ijrKdkKh 56491373 [0, 0]
568 v2-eBAdKwID 56491391 [0, 0]
569 v2-tJjMfKyL 56491392 [-1, -1]
570 v2-nzLyDgYW 56491365 [-1, -1]
All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.