Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Make Blocks addressable from the file reader #322

Open osopardo1 opened 5 months ago

osopardo1 commented 5 months ago

From v0.6.0 onwards, the structure of the Table is composed by files that contain multiple blocks, each of them belonging to the same or different cubes. This is part of the Multiblock format, that allowed Qbeast to balance the file layout without losing indexing benefits.

Now, blocks help us locate a particular cube on the file, but a single block is not addressable/retrievable from the spark reader. Although we are using Delta File Skipping to discard data based on min/max, we are not supporting such fine-grained search when Sampling is applied.

This change requires some work regarding #175 . Datasource V2 is more extensible and allows us to implement our reader. In this case, the reader should be designed to skip entire groups of rows based on the block number.

PS: This is something that @alexeiakimov had tried in previous issues, but some other priorities were raised.

TODOs: