apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] table comments not fully supported #7531

Open parisni opened 1 year ago

parisni commented 1 year ago

Hudi 12.1

When upsert spark DF with comments metadata, then it is present un the Avro shema commited. Also if enabled it is propagated in HMS. But spark datasource likely omit them while reading. As a result they are hidden when reading from spark

yihua commented 1 year ago

@parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?

parisni commented 1 year ago

When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?

That's it.

Well, basically the steps are:

  1. Create a DF
  2. Add comment to it
  3. Write the hudi table from that df
  4. Read the resulting table and print schema
  5. The comments are not shown while being in the avro schema

On December 21, 2022 8:32:07 PM UTC, Y Ethan Guo @.***> wrote:

@parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/7531#issuecomment-1362061079 You are receiving this because you were mentioned.

Message ID: @.***>

parisni commented 1 year ago

@yihua reproductible example

# add uuid column with comment foo bar
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data = [
    (1, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
    (2, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
    (3, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
    (4, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
]

schema = StructType(
    [
        StructField("uuid", IntegerType(), True, {"comment": "foo bar"}),
        StructField("user_id", StringType(), True),
        StructField("col1", StringType(), True),
        StructField("ts", StringType(), True),
        StructField("part", StringType(), True),
    ]
)
df = spark.createDataFrame(data=data, schema=schema)

tableName = "test_hudi_comment"
basePath = f"/tmp/hudi/"

hudi_options = {
    "hoodie.table.name": tableName,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.partitionpath.field": "part",
    "hoodie.datasource.write.table.name": tableName,
    "hoodie.datasource.write.operation": "insert",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.upsert.shuffle.parallelism": 1,
    "hoodie.insert.shuffle.parallelism": 1,
    "hoodie.datasource.hive_sync.enable": "false",
}
(df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
spark.read.format("hudi").load(basePath).registerTempTable("foo")
spark.sql("desc extended foo").show()

# there is no foo bar on the hudi side
+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
| _hoodie_commit_time|   string|   null|
|_hoodie_commit_seqno|   string|   null|
|  _hoodie_record_key|   string|   null|
|_hoodie_partition...|   string|   null|
|   _hoodie_file_name|   string|   null|
|                uuid|      int|   null|
|             user_id|   string|   null|
|                col1|   string|   null|
|                  ts|   string|   null|
|                part|   string|   null|
+--------------------+---------+-------+

# the avro has foo bar doc
  "partitionToWriteStats" : {
    "C" : [ {
      "fileId" : "e90400c1-5311-4fdc-83f2-757326c7560d-0",
      "path" : "C/e90400c1-5311-4fdc-83f2-757326c7560d-0_0-17-36_20221222095833220.parquet",
      "prevCommit" : "null",
      "numWrites" : 4,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 4,
      "totalWriteBytes" : 435614,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "C",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 435614,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"test_hudi_comment_record\",\"namespace\":\"hoodie.test_hudi_comment\",\"fields\":[{\"name\":\"uuid\",\"type\":[\"null\",\"int\"],\"doc\":\"foo bar\",\"default\":null},{\"name\":\"user_id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"col1\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"ts\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"part\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "INSERT"
}
xushiyan commented 1 year ago

@jonvex can you look into this please? looks like some config fixes should resolve it

jonvex commented 1 year ago

Verified this issue and created a Jira ticket

codope commented 1 year ago

Tracked in HUDI-5533