Open ion-elgreco opened 6 months ago
My understanding is that liquid clustering is not low hanging fruit it would require significant changes to how we write data. When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.
@Blajda ah my bad, then i misunderstood the complexity of the design document. I thought it was similar to Z-order as info using algorithm Y to collocate certain rows and then just write without partitioning
When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.
We shouldn't rely on the Hive-style paths at all in our codebase. Do we? The partition values are supposed to be read from the log, not the file path. To quote the protocol (emphasis mine):
This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log.
@wjones127 Yes I don't recall any explicit dependency on hive style paths. My primary concern is that tables that use liquid clustering do not allow for partitions hence it may requires some changes from the writers. It might be enough to disable partitioning on the table during creation and simply perform hilbert curves during the write.
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:
That would be actually pretty hilarious 😂
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR
FWIW there is a design doc up with some details, but I don't think it's enough detail for us to implement something compatible with the existing Databricks implementation.
@wjones127 how about this commit? https://github.com/andreaschat-db/delta/commit/2f33bf680a63b8070fac91561d035e93088c4f73
Description
Use Case To my understanding liquid clustering would share a lot of the code paths as to Z-order and would be part of optimize.
I think we only need to create a rust udf similar to the z-order that does Hilbert clustering.
I would need to do some more reading on the algorithm but it could be some low hanging fruit considering it likely shares a bunch of code paths.
Related Issue(s)