TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.83k stars 184 forks source link

Combining multiple TileDB stores? #1475

Open Hoeze opened 4 years ago

Hoeze commented 4 years ago

Hi, is there a possibility to overlay / concatenate multiple TileDB stores? @stavrospapadopoulos explained to me the concept of fragments in https://github.com/TileDB-Inc/TileDB/issues/1470 and this looks like a trivial starting point to combine multiple TileDB stores with the same schema.

Example use case: We'd like to use TileDB-VCF to store all of our genomic variant data in a single data store but keep the different datasets separate:

- TileDB-VCF
    + dataset-1
    + dataset-2

This is necessary for multiple reasons:

stavrospapadopoulos commented 4 years ago

Hi Florian,

This is a very insightful suggestion. It is related to a feature we have been discussing internally, namely array views. We want to enable a user to compose a "logical" array (or non-materialized view in DB jargon) from subarrays of different "physical" arrays.

In theory, a simplified version of this can be implemented rather easily, but it will require

If those conditions hold, then it should be just API work rather than actual algorithmic work. We can certainly consider that.

  • Depending on the type of analysis, one might only have a subset of the data visible

I understand the access control issue, but could you please elaborate on this?

  • From a technical point of view, we want to be able to re-create individual datasets

Could you please clarify what you mean by "re-create individual datasets"? For instance, in the case of TileDB-VCF you can even export the original single-sample VCF files from any array. And for the generic TileDB array case you can of course always slice only the desired rows and columns.

  • Different people in our lab have access to different datasets => We have to ensure on filesystem level that people cannot access data they are not allowed to see.

We certainly understand this very well. It may be worth mentioning TileDB Cloud (which may or may not be relevant to your use case down the road), only to point out that we took a different (more DB-oriented) approach to access control than relying on the filesystem. That may be important if you ever want to apply more fine-grained access policies (e.g., at the sample or gene level), which may end up being very cumbersome if you use the filesystem for that.

Hoeze commented 4 years ago

This is a very insightful suggestion. It is related to a feature we have been discussing internally, namely array views. We want to enable a user to compose a "logical" array (or non-materialized view in DB jargon) from subarrays of different "physical" arrays.

In theory, a simplified version of this can be implemented rather easily, but it will require

* The arrays to have an identical schema (as you mentioned)

* Each `dataset-*` to be written to a disjoint subarray in the array domain (e.g., `dataset-1` in rows 1-100, but `dataset-2` in rows 101-200).

If those conditions hold, then it should be just API work rather than actual algorithmic work. We can certainly consider that.

Array views would certainly be a great solution for our use case.

  • Depending on the type of analysis, one might only have a subset of the data visible

I understand the access control issue, but could you please elaborate on this?

This is meant as some kind of zero-cost array slicing. Consider e.g. you want to calculate summary statistics on a combination of selected experiments. Either we slice by the source of each sample or we just say variant_ds = tiledb.overlay([ds1, ds3, ds7]).

  • From a technical point of view, we want to be able to re-create individual datasets

Could you please clarify what you mean by "re-create individual datasets"? For instance, in the case of TileDB-VCF you can even export the original single-sample VCF files from any array. And for the generic TileDB array case you can of course always slice only the desired rows and columns.

We usually use tools like snakemake to build pipelines for creating reproducible analyses. Snakemake checks whether a needed file or folder exists or has to be re-created. In case we want to re-run the workflow (for example, when we modified it), we delete the dataset and snakemake does the rest. When using a monolithic database we would have to run some script that deletes the previously added entries, but that could end up corrupting the whole dataset.

  • Different people in our lab have access to different datasets => We have to ensure on filesystem level that people cannot access data they are not allowed to see.

We certainly understand this very well. It may be worth mentioning TileDB Cloud (which may or may not be relevant to your use case down the road), only to point out that we took a different (more DB-oriented) approach to access control than relying on the filesystem. That may be important if you ever want to apply more fine-grained access policies (e.g., at the sample or gene level), which may end up being very cumbersome if you use the filesystem for that.

stavrospapadopoulos commented 4 years ago

Array views would certainly be a great solution for our use case.

Great! Do you mind adding a feature request at https://feedback.tiledb.com/tiledb-core? We will try to work on it in the near future.

Either we slice by the source of each sample or we just say variant_ds = tiledb.overlay([ds1, ds3, ds7]).

Just to clarify, you can do that already with TileDB-VCF, which allows you to store any number of (g)VCF samples in a single 2D sparse array with samples as the rows and genomic positions as the columns. The only difference from the array views feature is that you currently need to query each dataset (2D array) and manage the results separately. The array view would simplify that in the sense that the result across multiple datasets would be organized in a single set of buffers (or dataframe if you are using our Python/Dask or Spark integrations). Do I understand this correctly?

When using a monolithic database we would have to run some script that deletes the previously added entries, but that could end up corrupting the whole dataset.

TileDB has a more flexible way to handle this than a monolithic database. I would suggest taking a look at the following docs: https://docs.tiledb.com/developer/basic-concepts/definitions/fragment https://docs.tiledb.com/developer/basic-concepts/physical-storage https://docs.tiledb.com/developer/basic-concepts/reading#time-traveling

Happy to rehash this on a separate thread when the time comes.

Hoeze commented 4 years ago

Array views would certainly be a great solution for our use case.

Great! Do you mind adding a feature request at https://feedback.tiledb.com/tiledb-core? We will try to work on it in the near future.

Done! See https://feedback.tiledb.com/tiledb-core/p/array-views

Either we slice by the source of each sample or we just say variant_ds = tiledb.overlay([ds1, ds3, ds7]).

Just to clarify, you can do that already with TileDB-VCF, which allows you to store any number of (g)VCF samples in a single 2D sparse array with samples as the rows and genomic positions as the columns. The only difference from the array views feature is that you currently need to query each dataset (2D array) and manage the results separately. The array view would simplify that in the sense that the result across multiple datasets would be organized in a single set of buffers (or dataframe if you are using our Python/Dask or Spark integrations). Do I understand this correctly?

Yes, the idea would be to eliminate the query.

Also, I could simply concatenate /dataset/A and /dataset/B along the sample dimension without language-specific code. Everything would look like a single logical TileDB dataset.

stavrospapadopoulos commented 4 years ago

Thanks @Hoeze! Yeah, that would be a great feature. I hope we get to work on it soon.