[112] Add a UUID Type - Githubissues

apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.

https://xtable.apache.org/

Apache License 2.0

919 stars 147 forks source link

[112] Add a UUID Type #518

Closed danielhumanmod closed 1 month ago

danielhumanmod commented 3 months ago

Important Read

Aim to address #112

What is the purpose of the pull request

Create a UUID type in OneType and handle it appropriately in all converters. Only Iceberg has a concept of a UUID and it is a fixed size byte array in the underlying parquet files. We should aim to preserve the metadata that this byte array came from a UUID for round trip conversions.

Brief change log

Add a UUID type in InternalType
Handle UUID internal type in Iceberg/Delta/Avro(for Hudi) Converters as a fixed-size byte array
Adding unit tests

Verify this pull request

This pull request is already covered by existing tests, all existing tests should pass

danielhumanmod commented 3 months ago

Do we need to handle UUID here for Hudi partition value? https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/main/java/org/apache/xtable/hudi/HudiPartitionValuesExtractor.java#L160-L162

Like:

case FIXED:
case BYTES:
case UUID:
        parsedValue = valueAsString.getBytes(StandardCharsets.UTF_8);
        break;

danielhumanmod commented 3 months ago

TODO: Probably we need some tests for round trip conversions, like Iceberg -> Delta Lake -> Iceberg to verify that UUIDs remain consistent throughout the conversions

the-other-tim-brown commented 3 months ago

Do we need to handle UUID here for Hudi partition value? https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/main/java/org/apache/xtable/hudi/HudiPartitionValuesExtractor.java#L160-L162

Like:
case FIXED:
case BYTES:
case UUID:
        parsedValue = valueAsString.getBytes(StandardCharsets.UTF_8);
        break;

@danielhumanmod yes, let's add that

the-other-tim-brown commented 3 months ago

TODO: Probably we need some tests for round trip conversions, like Iceberg -> Delta Lake -> Iceberg to verify that UUIDs remain consistent throughout the conversions

@danielhumanmod we don't have any full round trip testing yet but I would like to see some added to our integration tests in the future. For this PR we can start by adding a UUID to the schema used by the Iceberg table generator that is used in ITConversionController

the-other-tim-brown commented 2 months ago

@danielhumanmod apologies for the delay on this review. I was out of office for the last 10 days. Can you resolve the conflict that GitHub is flagging? I am also going to spend a little time trying to resolve the hudi side of the reader today in parallel to see if we just need some property set on the reader.

danielhumanmod commented 2 months ago

@danielhumanmod apologies for the delay on this review. I was out of office for the last 10 days. Can you resolve the conflict that GitHub is flagging? I am also going to spend a little time trying to resolve the hudi side of the reader today in parallel to see if we just need some property set on the reader.

Thanks Tim! Sure, I have solved the conflict in latest commit. And appreciate your time to help on the Hudi issue!

danielhumanmod commented 2 months ago

Hi @the-other-tim-brown , I’ve addressed all the issues mentioned in the previous comments and have squashed all the commits into one.

Additionally, I’ve refined the comparison logic in checkDatasetEquivalence to make it less invasive to the existing code. Now, the UUID-specific comparison only applies to datasets containing UUID fields, which also resolves the ordering issue.

Thank you so much for your detailed review and suggestions—they were incredibly helpful!

danielhumanmod commented 1 month ago

Thank you for your contribution!

Thank you so much for your detailed review and approval, @the-other-tim-brown ! Your feedback helped me improve the changes, and I learned a lot from the process. I’m really looking forward to contributing more to the project and continuing to collaborate with you in the future.