AWS Glue Hudi to ICEBERG Tables Fails #362

Closed soumilshah1995 closed 6 months ago

soumilshah1995 commented 7 months ago

Hello im trying to translate Hudi metadata into ICEBERG I was able to do Hudi to delta

sourceFormat: HUDI
    tableBasePath: s3://soumil-dev-bucket-1995/silver/table_name=orders/
    tableName: orders

Following above works

Hudi version : 0.12

Spark Version : 3.3.0-amzn-1

Java Version sh-4.2$ java -version openjdk version "1.8.0_392" OpenJDK Runtime Environment Corretto-8.392.08.1 (build 1.8.0_392-b08) OpenJDK 64-Bit Server VM Corretto-8.392.08.1 (build 25.392-b08, mixed mode) sh-4.2$

I see following error

sh-4.2$ java -jar  ./utilities-0.1.0-beta1-bundled.jar --dataset ./my_config.yaml
2024-03-02 14:06:30 INFO  io.onetable.utilities.RunSync:141 - Running sync for basePath s3://soumil-dev-bucket-1995/silver/table_name=orders/ for following table formats [ICEBERG]
2024-03-02 14:06:32 INFO  io.onetable.client.OneTableClient:264 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-02 14:06:35 ERROR io.onetable.spi.sync.TableFormatSync:61 - Failed to sync snapshot
java.lang.IllegalArgumentException: Cannot add field order_id as an identifier field: not a required field
        at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at org.apache.iceberg.Schema.validateIdentifierField(Schema.java:126) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at org.apache.iceberg.Schema.lambda$new$0(Schema.java:106) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_392]
        at org.apache.iceberg.Schema.<init>(Schema.java:106) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at org.apache.iceberg.Schema.<init>(Schema.java:91) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at org.apache.iceberg.Schema.<init>(Schema.java:83) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.iceberg.IcebergSchemaExtractor.toIceberg(IcebergSchemaExtractor.java:90) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.iceberg.IcebergClient.initializeTableIfRequired(IcebergClient.java:125) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.iceberg.IcebergClient.beginSync(IcebergClient.java:113) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.getSyncResult(TableFormatSync.java:107) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.spi.sync.TableFormatSync.syncSnapshot(TableFormatSync.java:54) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.client.OneTableClient.lambda$syncSnapshot$4(OneTableClient.java:167) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at java.util.HashMap.forEach(HashMap.java:1290) ~[?:1.8.0_392]
        at io.onetable.client.OneTableClient.syncSnapshot(OneTableClient.java:165) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.client.OneTableClient.sync(OneTableClient.java:122) ~[utilities-0.1.0-beta1-bundled.jar:?]
        at io.onetable.utilities.RunSync.main(RunSync.java:162) ~[utilities-0.1.0-beta1-bundled.jar:?]
2024-03-02 14:06:35 INFO  io.onetable.client.OneTableClient:127 - OneTable Sync is successful for the following formats [ICEBERG]
ForeverAngry commented 7 months ago

Hi! I think you have to define the partition spec, when the source is Hudi, right?

soumilshah1995 commented 7 months ago

Table are not partitioned

Sync ran fine for delta tables it only fails for iceberg

the-other-tim-brown commented 7 months ago

This is an issue with the Hudi table's record key field not being required (field is likely nullable in hudi metadata). You can try to rewrite the Hudi table with the record key field as a required (non-nullable) field.

Adding this to track our options for improvements in handling this case: https://github.com/apache/incubator-xtable/issues/366

the-other-tim-brown commented 7 months ago

@soumilshah1995 this is actually fixed in the code on main, which version are you running?

soumilshah1995 commented 7 months ago

I was using the jar that was given on GH page utilities-0.1.0-beta1-bundled.jar is that not the jar I should be using ?

the-other-tim-brown commented 7 months ago

You can use that jar but it is not the latest code with the bug fix. If you want to use the 0.1.0-beta1 jar, you'll need to rewrite your source table so the record key column is a required field.

soumilshah1995 commented 7 months ago

I want to kindly assure you that the record_key field is indeed not null. When you emphasize that record_key is required, it implies that writing to the Hudi table isn't possible if the keys are null. However, I'm uncertain if I fully grasp your point. Could you please provide further clarification?

the-other-tim-brown commented 7 months ago

Hudi can write a column with no null values but still list the field as nullable in its schema. By default when writing from spark, I believe all fields are listed as nullable. If this schema says the field is nullable, XTable will consider this a nullable field since it is listed as nullable in the schema of Hudi and the underlying parquet files.

You can check this by inspecting the Hudi commit metadata.

soumilshah1995 commented 7 months ago

While writing data do I need to specify schema

schema = StructType([
    StructField("customer_id", StringType(), nullable=False),
    StructField("name", StringType(), nullable=True),  # Example of nullable field
    StructField("state", StringType(), nullable=False),
    StructField("city", StringType(), nullable=False),
    StructField("email", StringType(), nullable=False),
    StructField("created_at", TimestampType(), nullable=False),
    StructField("address", StringType(), nullable=False)

this should fix the issue ?

the-other-tim-brown commented 7 months ago

@soumilshah1995 you would need to incorporate that schema into your writer. One way I've set the schema on the writer in the past is with the hoodie.write.schema option. This takes in a avro schema as a string.

soumilshah1995 commented 7 months ago

Roger that ill try this case and update soon

soumilshah1995 commented 6 months ago

Just wanted to update you guys I didn't get chance today to try ill try during my free time soon and update you guys on GH

soumilshah1995 commented 6 months ago


sourceFormat: HUDI

    tableBasePath: s3://soumil-dev-bucket-1995/silver/table_name=customers/
    tableName: customers
    partitionSpec: state:VALUE


2024-03-08 19:40:15 INFO  io.onetable.utilities.RunSync:141 - Running sync for basePath s3://soumil-dev-bucket-1995/silver/table_name=customers/ for following table formats [ICEBERG]
2024-03-08 19:40:17 INFO  io.onetable.client.OneTableClient:264 - No previous OneTable sync for target. Falling back to snapshot sync.
2024-03-08 19:40:27 INFO  io.onetable.client.OneTableClient:127 - OneTable Sync is successful for the following formats [ICEBERG]


soumilshah1995 commented 6 months ago

@the-other-tim-brown closing ticket