apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

[SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue #10182

Open sayanpaul-plaid opened 10 months ago

sayanpaul-plaid commented 10 months ago

Describe the problem you faced

We've observed that the CREATE TABLE DDL alphabetizes partition column names when syncing to Glue. The values in hoodie.properties are correct; this seems to only affect the Glue table. While this doesn't impact reads from Spark, it seems that it causes issues for Trino.

To Reproduce

Steps to reproduce the behavior:

  1. Create a Hudi table with the following code. Note that the partitioning columns are specified in c, a, b order.

    df = spark.createDataFrame([{"a": 1, "b": 1, "c": 1, "d": 1}, {"a": 2, "b": 2, "c": 2, "d": 1}])
    
    location = "s3://..."
    
    df.write.format("hudi").options(
        **{
            'hoodie.bootstrap.index.enable': 'false',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.partitionpath.field': 'c:SIMPLE,a:SIMPLE,b:SIMPLE',
            'hoodie.datasource.write.precombine.field': 'd',
            'hoodie.datasource.write.recordkey.field': 'd',
            'hoodie.datasource.write.table.name': 'test_nonalpha_partitioning',
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.table.name': 'test_nonalpha_partitioning',
        }
    ).save(location)
    
    spark.sql(f"""
        create table prototype_lakehouse_testing.test_nonalpha_partitioning
        using hudi
        location '{location}'
    """)
  2. Observe that the Glue table reports partition columns in alphabetical order:
    ❯ aws glue get-table --database-name 'prototype_lakehouse_testing' --name 'test_nonalpha_partitioning' | jq '.Table.PartitionKeys'
    [
      {
        "Name": "a",
        "Type": "bigint"
      },
      {
        "Name": "b",
        "Type": "bigint"
      },
      {
        "Name": "c",
        "Type": "bigint"
      }
    ]

    while the table's hoodie.properties reports hoodie.table.partition.fields=c,a,b

Expected behavior

We expect the Glue table to preserve the partition column order.

Environment Description

The above was run on an AWS EMR cluster running version emr-6.10.1

Additional context

Add any other context about the problem here.

Stacktrace

n/a

nfarah86 commented 10 months ago

tagging @ad1happy2go

ad1happy2go commented 10 months ago

@sayanpaul-plaid I will look into it.

ad1happy2go commented 10 months ago

@CTTY @xicm Do we have any insights here w.r.t glue sync?