delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.17k stars 1.62k forks source link

[Feature Request] Liquid Clustering #1874

Open tdas opened 11 months ago

tdas commented 11 months ago

Overview

We propose to introduce Liquid Clustering, a new effort to revamp how clustering works in Delta, which addresses the shortcomings of Hive-style partitioning and current ZORDER clustering.

Motivation

Partitioning/clustering is a common technique to manage datasets to reduce unnecessary data processing. Hive-style partitioning and ZORDER (multi-dimensional) clustering are existing solutions in Delta, but both have limitations.

Hive-style partitioning

Hive-style partitioning clusters data such that every file contains exactly one distinct combination of partition column values. Although Hive-style partitioning works well if tuned correctly, there are limitations:

ZORDER clustering

ZORDER is a muti-dimensional clustering technique used in Delta. The OPTIMIZE ZORDER BY command applies ZORDER clustering and improves the performance of queries that utilize ZORDER BY columns in their predicates. However, it has the following limitations:

Detailed Design

Please refer to the document here.

akshaysinghas commented 11 months ago

Hi guys, do we have any white paper or link that I can refer to understand MDC, or liquid clustering? I know requirement doc is on the way but any other resource that can throw more light on this feature?

felipepessoto commented 11 months ago

This feature in on Delta Lake 3.0.0 Preview release notes, is it expected to have an 3.0.0 RC2 with the feature?

Usually, the Delta final version is released few weeks after RC1. It seems this won't make to the 3.0.0 final?

kangkaisen commented 11 months ago

@tdas Hi,does Liquid Clustering use the DBSCAN algorithm? Thanks.

nachiketrajput-upl commented 11 months ago

@akshaysinghUdaan: Kindly use the link, https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering

loquisgon commented 10 months ago

This needs a LOT more detail: "Liquid clustering uses a continuous fractal space filling curve as a multi-dimensional clustering technique, which significantly improves data skipping"

loquisgon commented 10 months ago

"For multidimensional databases, Hilbert order has been proposed to be used instead of Z order because it has better locality-preserving behavior. For example, Hilbert curves have been used to compress and accelerate R-tree indexes[9] (see Hilbert R-tree). They have also been used to help compress data warehouses.[10][11]"

https://en.wikipedia.org/wiki/Hilbert_curve#:~:text=The%20Hilbert%20curve%20(also%20known,by%20Giuseppe%20Peano%20in%201890.

imback82 commented 10 months ago

This needs a LOT more detail: "Liquid clustering uses a continuous fractal space filling curve as a multi-dimensional clustering technique, which significantly improves data skipping"

@loquisgon we are working on the design doc and we will have more details on clustering algorithm. Stay tuned!

KamilKandzia commented 9 months ago

@imback82 Any progress on the doc or implementing solution in delta project? Databricks 13.2 has already implemented the liquid clustering, but we still have lack of information about this solution

dabao521 commented 7 months ago

@vkorukanti can you assign this issue to me?

zedtang commented 7 months ago

Hello everyone, we've added the design doc to this issue. Please feel free to review it. Thank you!

felipepessoto commented 7 months ago

@zedtang if you don't mind to share the doc using this option which allows to print and comment (if you select Commenter). Even if you leave Viewer permission only, at least that option allow us to print it.

image

https://support.google.com/drive/answer/2494822?hl=en&co=GENIE.Platform%3DDesktop#zippy=%2Cshare-a-file-publicly

Previous design docs were shared that way

zedtang commented 7 months ago

Hi @felipepessoto , I updated the link to give everyone commenter access!

zmeir commented 7 months ago

Hi, will it be possible to use liquid clustering on partitioned tables?

ghost commented 6 months ago

In the design doc I don't see any info if clustering can be applied to existing Delta tables. As we use liquid from databricks already we observed that either you have to apply liquid to an empty table or you have to use the like command which basically copies and creates a new Delta table

I guess it will be really frustrating if we can't apply liquid to existing Delta tables :)

Thanks already everyone

auckenth commented 6 months ago

According to https://delta.io/pdfs/DLTDG_ER3.pdf (page 131) k-d trees are used to implement liquid clustering instead of space filling curves. Are there any pros and cons for an implementation using k-d trees?

imback82 commented 6 months ago

In the design doc I don't see any info if clustering can be applied to existing Delta tables. As we use liquid from databricks already we observed that either you have to apply liquid to an empty table or you have to use the like command which basically copies and creates a new Delta table

I guess it will be really frustrating if we can't apply liquid to existing Delta tables :)

Good point. I think we should be able to migrate existing tables in place by ALTER TABLE ... CLUSTER BY .... This should be straightforward for un-partitioned tables, but it will be tricky to implement it for partitioned tables.

dennyglee commented 6 months ago

According to https://delta.io/pdfs/DLTDG_ER3.pdf (page 131) k-d trees are used to implement liquid clustering instead of space filling curves. Are there any pros and cons for an implementation using k-d trees?

Oh sorry about that @auckenth - this is an error within the book. We will be updating the book chapter with the latest info per the design doc.

ghost commented 6 months ago

In the design doc I don't see any info if clustering can be applied to existing Delta tables. As we use liquid from databricks already we observed that either you have to apply liquid to an empty table or you have to use the like command which basically copies and creates a new Delta table I guess it will be really frustrating if we can't apply liquid to existing Delta tables :)

Good point. I think we should be able to migrate existing tables in place by ALTER TABLE ... CLUSTER BY .... This should be straightforward for un-partitioned tables, but it will be tricky to implement it for partitioned tables.

@imback82 This would be great. I think considering non-partitioned tables which are optionally z-ordered is enough as a baseline. For partitioned tables you neverless accept that any changes might need an overwrite.

acruise commented 2 months ago

It'd be nice if someone could address comments and questions in the design doc. :)

adriangb commented 1 month ago

Sorry if this is not the right place to ask, but is there anyway with liquid cluster to enforce some level of clustering? In particular I want to cluster by something like project,day and implement retention such that I delete data older than X days which is different for each project. With hive style I can partition on those columns and delete entire files without rewrites.

zedtang commented 3 weeks ago

It'd be nice if someone could address comments and questions in the design doc. :)

Sorry for the delay, I replied in the design doc.