TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
89 stars 16 forks source link

Feature request: exporting regions based on annotations #790

Open andreaswallberg opened 21 hours ago

andreaswallberg commented 21 hours ago

Dear developers,

I wonder if it is possible to export SNPs for loci or regions based on annotations derived from the original VCF.

For example, exporting all SNPs with missense variants for gene X.

If not, I would like to request that feature, which could be very useful in day-to-day work.

A connected issue is: how does SNP data actually relate to annotations?

Lets say I ingest a VCF that are annotated with gene annotations v1.0 in a first batch. At a later stage, I ingest a VCF containing new samples but with refined annotations, i.e. v2.0, in a second batch. What happens to the annotations at this point, how do they relate to each batch and how can I interact with them?

Can I re-annotate SNPs that are shared between the two batches? Moreover, can I update annotations without ingesting new samples?

leipzig commented 15 hours ago

Hi Andreas, Our Academy has some content about using annotations in this manner: https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/tutorials/advanced/annotations/

We have some solutions for annotating TileDB-VCF datasets in TileDB Cloud in a N+1 manner, so that only newly encountered variants are annotated. These produce external arrays.