Open parisni opened 1 year ago
@parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying spark DF with comments metadata
, do you mean the schema associated with the dataframe has the comments?
When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?
That's it.
Well, basically the steps are:
On December 21, 2022 8:32:07 PM UTC, Y Ethan Guo @.***> wrote:
@parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying
spark DF with comments metadata
, do you mean the schema associated with the dataframe has the comments?-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/7531#issuecomment-1362061079 You are receiving this because you were mentioned.
Message ID: @.***>
@yihua reproductible example
# add uuid column with comment foo bar
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data = [
(1, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
(2, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
(3, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
(4, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
]
schema = StructType(
[
StructField("uuid", IntegerType(), True, {"comment": "foo bar"}),
StructField("user_id", StringType(), True),
StructField("col1", StringType(), True),
StructField("ts", StringType(), True),
StructField("part", StringType(), True),
]
)
df = spark.createDataFrame(data=data, schema=schema)
tableName = "test_hudi_comment"
basePath = f"/tmp/hudi/"
hudi_options = {
"hoodie.table.name": tableName,
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.datasource.write.partitionpath.field": "part",
"hoodie.datasource.write.table.name": tableName,
"hoodie.datasource.write.operation": "insert",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.upsert.shuffle.parallelism": 1,
"hoodie.insert.shuffle.parallelism": 1,
"hoodie.datasource.hive_sync.enable": "false",
}
(df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
spark.read.format("hudi").load(basePath).registerTempTable("foo")
spark.sql("desc extended foo").show()
# there is no foo bar on the hudi side
+--------------------+---------+-------+
| col_name|data_type|comment|
+--------------------+---------+-------+
| _hoodie_commit_time| string| null|
|_hoodie_commit_seqno| string| null|
| _hoodie_record_key| string| null|
|_hoodie_partition...| string| null|
| _hoodie_file_name| string| null|
| uuid| int| null|
| user_id| string| null|
| col1| string| null|
| ts| string| null|
| part| string| null|
+--------------------+---------+-------+
# the avro has foo bar doc
"partitionToWriteStats" : {
"C" : [ {
"fileId" : "e90400c1-5311-4fdc-83f2-757326c7560d-0",
"path" : "C/e90400c1-5311-4fdc-83f2-757326c7560d-0_0-17-36_20221222095833220.parquet",
"prevCommit" : "null",
"numWrites" : 4,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 4,
"totalWriteBytes" : 435614,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "C",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 435614,
"minEventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "{\"type\":\"record\",\"name\":\"test_hudi_comment_record\",\"namespace\":\"hoodie.test_hudi_comment\",\"fields\":[{\"name\":\"uuid\",\"type\":[\"null\",\"int\"],\"doc\":\"foo bar\",\"default\":null},{\"name\":\"user_id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"col1\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"ts\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"part\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
},
"operationType" : "INSERT"
}
@jonvex can you look into this please? looks like some config fixes should resolve it
Verified this issue and created a Jira ticket
Tracked in HUDI-5533
Hudi 12.1
When upsert spark DF with comments metadata, then it is present un the Avro shema commited. Also if enabled it is propagated in HMS. But spark datasource likely omit them while reading. As a result they are hidden when reading from spark