[SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue

sayanpaul-plaid commented 10 months ago

Describe the problem you faced

We've observed that the CREATE TABLE DDL alphabetizes partition column names when syncing to Glue. The values in hoodie.properties are correct; this seems to only affect the Glue table. While this doesn't impact reads from Spark, it seems that it causes issues for Trino.

To Reproduce

Steps to reproduce the behavior:

Create a Hudi table with the following code. Note that the partitioning columns are specified in c, a, b order.

df = spark.createDataFrame([{"a": 1, "b": 1, "c": 1, "d": 1}, {"a": 2, "b": 2, "c": 2, "d": 1}])

location = "s3://..."

df.write.format("hudi").options(
    **{
        'hoodie.bootstrap.index.enable': 'false',
        'hoodie.datasource.write.hive_style_partitioning': 'true',
        'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.partitionpath.field': 'c:SIMPLE,a:SIMPLE,b:SIMPLE',
        'hoodie.datasource.write.precombine.field': 'd',
        'hoodie.datasource.write.recordkey.field': 'd',
        'hoodie.datasource.write.table.name': 'test_nonalpha_partitioning',
        'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
        'hoodie.table.name': 'test_nonalpha_partitioning',
    }
).save(location)

spark.sql(f"""
    create table prototype_lakehouse_testing.test_nonalpha_partitioning
    using hudi
    location '{location}'
""")

Observe that the Glue table reports partition columns in alphabetical order:

❯ aws glue get-table --database-name 'prototype_lakehouse_testing' --name 'test_nonalpha_partitioning' | jq '.Table.PartitionKeys'
[
  {
    "Name": "a",
    "Type": "bigint"
  },
  {
    "Name": "b",
    "Type": "bigint"
  },
  {
    "Name": "c",
    "Type": "bigint"
  }
]

while the table's hoodie.properties reports hoodie.table.partition.fields=c,a,b

Expected behavior

We expect the Glue table to preserve the partition column order.

Environment Description

The above was run on an AWS EMR cluster running version emr-6.10.1

Hudi version : 0.12.2-amzn-0
Spark version : 3.3.1
Hive version 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Spark on Docker

Additional context

Add any other context about the problem here.

Stacktrace

n/a

nfarah86 commented 10 months ago

tagging @ad1happy2go

ad1happy2go commented 10 months ago

@sayanpaul-plaid I will look into it.

ad1happy2go commented 10 months ago

@CTTY @xicm Do we have any insights here w.r.t glue sync?

apache / hudi

[SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue #10182