TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
88 stars 15 forks source link

Some questions about functionality #707

Closed andreaswallberg closed 2 months ago

andreaswallberg commented 5 months ago

Dear developers and community,

I am looking at this technology with much interest and I have a couple of questions.

  1. Using TileDB-VCF with TileDB Embedded, is there any built-in function to control user roles, credentials and permissions with regards to for example who can store and export data or is the tool simply inheriting the UNIX/POSIX system users and permissions?
  2. Is the database able to store complex variants and haplotypes (for example, small or large indels, haplotype blocks) in addition to SNPs? How about phased vs. unphased variants?

Best regards, Andreas

leipzig commented 5 months ago

Hi Andreas,

  1. Yes you are correct - with the open source TileDB-VCF you would be managing your own users and groups whether on your filesystem, whether that be local or on s3. We provide the open-source package under an MIT license so you can always access your data. With TileDB Cloud (the SaaS and Enterprise versions) the AWS credential layer is all managed on your behalf, and there is a whole system of governance in place to manage namespace, users, and data sharing at a highly granular level.
  2. Anything that can be stored in a VCF file will be faithfully stored on the TileDB side - the database can store or retrieve any data supported by htslib. There is full support for SNPs throughout the stack, variants of any length can be retrieved in data frames, and phased/unphased genotypes (| vs /) are distinguishable from VCF output when reading from a TileDB-VCF dataset. As you can see with this issue https://github.com/TileDB-Inc/TileDB-VCF/issues/647 sometimes the anchor-gap or other parameters may need to be tweaked for performance in VCFs with a lot of large SVs.

Thanks for reaching out. Lemme know if you have further questions!

andreaswallberg commented 5 months ago

Thanks for the quick and clear reply!

  1. Perhaps this is not the role of the open source development to support, but is there a migration path from TileDB Embedded to TileDB Cloud such that data can be uploaded without complicated interconvert or risk of data loss?
  2. Moreover, in either TileDB Embedded or Cloud, is there a concept of versioning in the database so that users with different project roles may interact with different versions or segments of the database?
leipzig commented 5 months ago
  1. The way TileDB Cloud works is that the data stays in your s3 buckets. You then "register" these existing TileDB-VCF stores in TileDB Cloud to allow you to compute on them, as well as manage their permissions within your group. So there is no conversion or data loss - the TileDB Open Source and TileDB Cloud use the exact same data in the same location.

    Screenshot 2024-04-26 at 9 49 41 AM
  2. There is versioning but access is not restricted to certain versions - if you have access to the array then you can time-travel back to any point, like git. Granular access to certain samples is on the roadmap. We'd love to understand your intended use so we can consider it in our plans.