apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.47k stars 2.43k forks source link

[SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue #10182

Open sayanpaul-plaid opened 1 year ago

sayanpaul-plaid commented 1 year ago

Describe the problem you faced

We've observed that the CREATE TABLE DDL alphabetizes partition column names when syncing to Glue. The values in hoodie.properties are correct; this seems to only affect the Glue table. While this doesn't impact reads from Spark, it seems that it causes issues for Trino.

To Reproduce

Steps to reproduce the behavior:

  1. Create a Hudi table with the following code. Note that the partitioning columns are specified in c, a, b order.

    df = spark.createDataFrame([{"a": 1, "b": 1, "c": 1, "d": 1}, {"a": 2, "b": 2, "c": 2, "d": 1}])
    
    location = "s3://..."
    
    df.write.format("hudi").options(
        **{
            'hoodie.bootstrap.index.enable': 'false',
            'hoodie.datasource.write.hive_style_partitioning': 'true',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.partitionpath.field': 'c:SIMPLE,a:SIMPLE,b:SIMPLE',
            'hoodie.datasource.write.precombine.field': 'd',
            'hoodie.datasource.write.recordkey.field': 'd',
            'hoodie.datasource.write.table.name': 'test_nonalpha_partitioning',
            'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
            'hoodie.table.name': 'test_nonalpha_partitioning',
        }
    ).save(location)
    
    spark.sql(f"""
        create table prototype_lakehouse_testing.test_nonalpha_partitioning
        using hudi
        location '{location}'
    """)
  2. Observe that the Glue table reports partition columns in alphabetical order:
    ❯ aws glue get-table --database-name 'prototype_lakehouse_testing' --name 'test_nonalpha_partitioning' | jq '.Table.PartitionKeys'
    [
      {
        "Name": "a",
        "Type": "bigint"
      },
      {
        "Name": "b",
        "Type": "bigint"
      },
      {
        "Name": "c",
        "Type": "bigint"
      }
    ]

    while the table's hoodie.properties reports hoodie.table.partition.fields=c,a,b

Expected behavior

We expect the Glue table to preserve the partition column order.

Environment Description

The above was run on an AWS EMR cluster running version emr-6.10.1

Additional context

Add any other context about the problem here.

Stacktrace

n/a

nfarah86 commented 12 months ago

tagging @ad1happy2go

ad1happy2go commented 12 months ago

@sayanpaul-plaid I will look into it.

ad1happy2go commented 11 months ago

@CTTY @xicm Do we have any insights here w.r.t glue sync?