Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Roll-up Leaves Command #150

Closed osopardo1 closed 10 months ago

osopardo1 commented 1 year ago

Small files problem is a hot and complex topic in Data Lakehouse systems nowadays. High writing workloads can deteriorate the Data Layout by producing lots of small files, and increasing significantly the cost of listing and reading from the object storage.

One solution is Compaction, which uses a bin-packing algorithm to group small files and produce bigger ones. But this is a way too generic methodology for Qbeast, since each file belongs to a cube, and we need to add another grouping layer on top.

Another solution that we can think of is what we call roll-up.
The roll-up recursively group and send sibling payloads to their parent cubes as long as the resulting size is under a certain threshold, resulting in fewer and larger blocks to write.

Here's a visual example of rolling up leaves cubes of the tree.

Screenshot 2023-01-23 at 12 45 43

We implemented it in the Vanilla version of the code (#146 ) as well as in the Improved Double Pass (#147), and, although the solution on the Improved Double Pass showed better reading results of TPC-DS workload, there's a penalty to consider at writing time. A bigger portion of computation and resources are used to analyze the tree structure and modify the final index results.

Instead of merging this into the base implementation, we are considering making an external command that compacts those leaves into their parents and outputs a better file structure.

It could be something like:

QbeastTable.rollUp(tableID, maxFileSize...)
osopardo1 commented 1 year ago

Does this implementation still makes sense due to the current #173 approach? @Jiaweihu08

Jiaweihu08 commented 1 year ago

Does this implementation still makes sense due to the current #173 approach? @Jiaweihu08

I think it does, staging enforces a lower-bound for the writes but still, the problem of leaves having smaller sizes remains the same.

osopardo1 commented 1 year ago

Okei, I am going to keep it alive for discussion/documentation purposes.

osopardo1 commented 10 months ago

Since new developments on #210 would include a Rollup code in the main branch, I peacefully closed this issue.