apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.18k stars 2.38k forks source link

[SUPPORT] Hudi don't propagate column comments into hive metastore / parquet files #5363

Closed parisni closed 2 years ago

parisni commented 2 years ago

when a spark schema has a metadata with a comment field, then the spark writer propagates the comment into the metastore.

Then other metastore client (hive, presto) can describe the table and get comments.

It turns out hudi does not support them: when such comment is added to the schema, the resulting table don't get the comment.

Digging the source code, the schema comes either from the hudi commit metadata in avro format or by reading the last parquet file. However the initial comment is not present in both.

codope commented 2 years ago

@parisni this is a known issue. We did not see a strong use case to add comments. May I know your usecase. Perhaps we can take it up in a future release.

parisni commented 2 years ago

our use case is improve quality of our lakehouse. Hudi tables are often accessible to end users (they allow to apply GDPR treatment) and the column/tables comments is a neat way to improve data analysts quality and user experience. Also our upstream data source sometimes do have comments (parquet metadata / hive metastore regular comments) and when transformed into hudi, that information is lost.

nsivabalan commented 2 years ago

would this work for you https://github.com/apache/hudi/pull/4960 or are you looking for something else?

parisni commented 2 years ago

Indeed, this is exactly what I am looking for ! thanks

On Tue, 2022-04-26 at 19:33 -0700, Sivabalan Narayanan wrote:

would this work for you https://github.com/apache/hudi/pull/4960 or are you looking for something else?

yihua commented 2 years ago

@parisni Glad to know that. Closing this issue. Let us know if you have additional questions.