apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.24k stars 2.39k forks source link

[SUPPORT] Column comments not syncing to AWS Glue Catalog #8857

Closed cbts-alec-johnson closed 1 year ago

cbts-alec-johnson commented 1 year ago

Describe the problem you faced

Column comments are not synced to the AWS Glue Data Catalog when setting hoodie.datasource.hive_sync.sync_comment to true and adding column comments in the dataframe schema metadata.

To Reproduce

Steps to reproduce the behavior:

  1. Add a comment to a field in a dataframe.
  2. Set hoodie.datasource.hive_sync.sync_comment to true
  3. Write the dataframe to the AWS Glue Catalog
df = df.withMetadata('col1', {'comment': 'description of the column'})

Expected behavior

Setting hoodie.datasource.hive_sync.sync_comment to true when the dataframe has column comments should sync the comments to the Glue Catalog.

Environment Description

Additional context

It looks like the comment is manually set to empty in this function here. It should instead get the comment from the dataframe schema metadata.

https://github.com/apache/hudi/blob/c6dadd4cb5d82d4afa9dbfd4b089c02ebe06c14c/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java#L446-L457

Stacktrace

danny0405 commented 1 year ago

Guess this is what you need: https://github.com/apache/hudi/pull/8740/files

cbts-alec-johnson commented 1 year ago

Guess this is what you needed: https://github.com/apache/hudi/pull/8740/files

Yes this is what I need. Also, I think you may have labeled this gcp-support instead of aws-support?

TrustOkoroego commented 3 months ago

@cbts-alec-johnson I need to implement this. Could you please tell you your configuration to sync the comments

cbts-alec-johnson commented 3 months ago

@cbts-alec-johnson I need to implement this. Could you please tell you your configuration to sync the comments

@TrustOkoroego I believe that the columns are synced during a table update correctly. However the columns are not synced during table creation since the comment is set to empty like shown above.