TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.85k stars 185 forks source link

Idea: include a bloom filter in sparse array MBRs #2375

Open gatesn opened 3 years ago

gatesn commented 3 years ago

My own motivation for this comes from modelling labelled dimensions with dictionary encoding. e.g. I have labels A: 0, B: 1, C: 2. When slicing an array for label B any fragment/tile that includes labels A and C is considered relevant.

I understand there may be discussions/thoughts on supporting labelled dimensions in a first-class way and therefore not sure if this idea is generally applicable beyond this use-case, though I suspect it is given its support in other formats, e.g. Parquet: https://github.com/apache/parquet-format/blob/master/BloomFilter.md. One other use-case that does come to mind is var-sized string/byte dimensions.

It also might first make sense to generalise the on-disk fragment metadata format to allow for arbitrary extensions to "metadata" (bloom filter, value sets, other dim statistics). This would make it easier to add additional metadata in the future, as well as enabling forward-compatibility such that old readers can still read files from newer writers by just ignoring any metadata feature that they don't support.

stavrospapadopoulos commented 3 years ago

@gatesn this is already in our roadmap, along with dictionary compression, RLE compression for strings and min/max values for attribute tiles. We hope to implement those soon.