Open michael1991 opened 3 weeks ago
@michael1991 Thanks for raising this. Can you help me to reproduce this issue. I tried below but it was working fine for me.
fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "CityNameDummyBigFieldName": fake.city(), "ts":"1",
"StateNameDummyBigFieldName": fake.state(), "Country": fake.country()} for _ in range(1000)]
pandas_df = pd.DataFrame(data)
hoodie_properties = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.recordkey.field': 'ID',
'hoodie.datasource.write.partitionpath.field': 'StateNameDummyBigFieldName,CityNameDummyBigFieldName',
'hoodie.table.name' : 'test'
}
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
for i in range(1, 50):
df.write.format("hudi").options(**hoodie_properties).mode("append").save(PATH)
Hi @ad1happy2go , glad to hear you again ~ Can you try column name with underscore, i'm not sure if enable urlencode for partition and partition column name with underscore could make this happen.
@michael1991 How many number of partitions in the table? Is it possible to get the URI? I was not able to reproduce this though.
@ad1happy2go Partitions are hours, for example, gs://bucket/tables/hudi/r_date=2024-06-17/r_hour=00. But problem only occurs on two partitions and underscore, we are using one partition column like yyyyMMddHH and it's going on well. Not sure the exact cause.
Can you try reproducing this issue with the sample code. @michael1991 , That will help us to triage it better
Describe the problem you faced
I'm using Spark3.5 + Hudi0.15.0 for partitioned table, when I choose
req_date
andreq_hour
for partition column name, I will get this error, but task would be executed successfully finally; when I choosedate
andhour
for partition column name, error disappeared.Expected behavior
We should get no errors when we just make partition column names a bit longer.
Environment Description
Hudi version : 0.15.0
Spark version : 3.5.0
Hive version : NA
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : GCS
Running on Docker? (yes/no) : no
Stacktrace