delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request] Bucketing implementation in Delta Lake #3495

Open wudanzy opened 3 months ago

wudanzy commented 3 months ago

Feature request

Which Delta project/connector is this regarding?

Overview

Implement bucketing in Delta lake to speed up aggregation and join cases.

Motivation

Currently, I found that Delta Lake doesn’t support bucketing. This leads to inefficiency for two kinds of use cases:

The bucketing was proposed in spark to solve the above problems (see original JIRA and design), so spark has supported bucketing for several years. However, the delta lake does not support bucketing. Delta lake has developed features Z-ordering and liquid clustering, but both features are for data skipping, so both features cannot help avoiding unnecessary shuffles in aggregation & joins.

Further details

The design is here.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

MasterDDT commented 3 months ago

@dennyglee Hi from ActionIQ, once the design doc has some comments and is updated, could we get someone from Delta org to take a look?

dennyglee commented 3 months ago

Sorry for missing this @MasterDDT - will review this shortly!

MasterDDT commented 1 month ago

@dennyglee was anybody able to review ^^ doc?