delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.99k stars 362 forks source link

Add liquid clustering #2043

Open ion-elgreco opened 6 months ago

ion-elgreco commented 6 months ago

Description

Use Case To my understanding liquid clustering would share a lot of the code paths as to Z-order and would be part of optimize.

I think we only need to create a rust udf similar to the z-order that does Hilbert clustering.

I would need to do some more reading on the algorithm but it could be some low hanging fruit considering it likely shares a bunch of code paths.

Related Issue(s)

Blajda commented 6 months ago

My understanding is that liquid clustering is not low hanging fruit it would require significant changes to how we write data. When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.

ion-elgreco commented 6 months ago

@Blajda ah my bad, then i misunderstood the complexity of the design document. I thought it was similar to Z-order as info using algorithm Y to collocate certain rows and then just write without partitioning

wjones127 commented 6 months ago

When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.

We shouldn't rely on the Hive-style paths at all in our codebase. Do we? The partition values are supposed to be read from the log, not the file path. To quote the protocol (emphasis mine):

This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log.

Blajda commented 6 months ago

@wjones127 Yes I don't recall any explicit dependency on hive style paths. My primary concern is that tables that use liquid clustering do not allow for partitions hence it may requires some changes from the writers. It might be enough to disable partitioning on the table during creation and simply perform hilbert curves during the write.

rtyler commented 6 months ago

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:

ion-elgreco commented 6 months ago

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:

That would be actually pretty hilarious 😂

wjones127 commented 6 months ago

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR

FWIW there is a design doc up with some details, but I don't think it's enough detail for us to implement something compatible with the existing Databricks implementation.

https://github.com/delta-io/delta/issues/1874

https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.skpz7c7ga1wl

ion-elgreco commented 6 months ago

@wjones127 how about this commit? https://github.com/andreaschat-db/delta/commit/2f33bf680a63b8070fac91561d035e93088c4f73