Open eshu opened 8 months ago
you are right, we should enode the partition path for these special characters.
@eshu I tried to insert these values and at least read/write worked fine. I do understand in case of slash it created the inner sub folder. Were you able to make it work by encoding them. Let us know in case you need any other help here or Feel free to close if all good.
columns = ["ts","uuid","rider","driver","fare","city"]
data =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san francisco"),
(1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-B","driver-L",27.70 ,"san-francisco"),
(1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-C","driver-M",33.90 ,"san_francisco%"),
(1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-C","driver-N",34.15,"sao/paulo")]
spark = get_spark_session(spark_version="3.2", hudi_version="0.13.0")
inserts = spark.createDataFrame(data).toDF(*columns)
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field' : 'uuid',
'hoodie.datasource.write.precombine.field' : 'ts',
'hoodie.datasource.write.partitionpath.field': 'city',
}
# Insert data
inserts.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
spark.read.format("hudi").load(basePath).show()
Similar jira raised to fix this issue - https://issues.apache.org/jira/browse/HUDI-7484
@ad1happy2go It does not work in my example. Did you tried it?
Yes I tried this - https://github.com/apache/hudi/issues/10754#issuecomment-1979027421
Can you try the same?
@eshu Any updates on the same?
When the partition column contains the slash character ("/"), Hudi could write the data incorrectly or do not read the back.
Test (I use some helpers to write and read Hudi data, they write write data to the local FS and read it):
The output is
As you can see rows 13 and 14 was not read, and "partition" and "partition/" on the file system have the same path (I am not sure about the impact, but probably there could be performance issues).
Maybe it would be great to quote some characters in partition paths?
Environment Description
Hudi version : 0.13.1
Storage (HDFS/S3/GCS..): Local FS