apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

Fail to add default partition #10154

Open njalan opened 11 months ago

njalan commented 11 months ago

I got below error message:

Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL ALTER TABLE ods_xxx.xx ADD IF NOT EXISTS PARTITION (xx='HIVE_DEFAULT_PARTITION') LOCATION 'xxxx/HIVE_DEFAULT_PARTITION' at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:70) at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.lambda$addPartitionsToTable$0(QueryBasedDDLExecutor.java:124) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.addPartitionsToTable(QueryBasedDDLExecutor.java:124) at org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:109) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:445) at org.apache.hudi.hive.HiveSyncTool.syncAllPartitions(HiveSyncTool.java:399) ... 69 more Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10111]: Partition value contains a reserved substring (User value: HIVE_DEFAULT_PARTITION Reserved substring: HIVE_DEFAULT_PARTITION)

Environment Description

Hudi version : 0.13.1

Spark version : 3.0.1

Hive version : 3.1

Hadoop version : 3.2.2

Storage (HDFS/S3/GCS..) :

Running on Docker? no :

ad1happy2go commented 10 months ago

@njalan Do your partition column in data contains NULLS? When are you facing this error? Looks like you are trying to add the null partition. It may not be hudi related but more of hive related issue.

You may try - ALTER TABLE ods_xxx.xx ADD IF NOT EXISTS PARTITION (xx=null) LOCATION 'xxxx/HIVE_DEFAULT_PARTITION'

CaesarWangX commented 4 months ago

Hi @danny0405 @ad1happy2go @xushiyan When we upgraded from 0.11.1 to 0.14.0, the default partition value was changed from default to HIVE-DEFAULT-PARTION, which caused two problems:

  1. An error will be reported during hive sync, resulting in task failure Caused by: org. apache. hadoop. hive. ql. part. SemanticException: Partition value contains a reserved substring
  2. There is already a large amount of data in the historical table, and some of the data is still in default. Our upgrade process does not want to change it, and we will continue to write data with null partition values to the default

We hope that the upgrade of Hudi should not affect the business, and configuration items should be provided for users to choose from. This is very unfriendly to users now😅

danny0405 commented 4 months ago

Thanks for the feedback @CaesarWangX , do you try HMS as the sync mode then, the 1st is unexpected and should be a bug, the motive is to keep sync with Hive for default partition name, but now it causes problems reported by Hive.

For the 2nd, there might be no easy way to be compatible with history data set because partition path is hotspot code path and we might not consider the ramifications for history values for each record. If you uses the Flink for ingestion, there is config option named partition.default_name to switch to other default value as needed.

@ad1happy2go , would like to see if you have time to make it clear whether the 1st issues is only limited for JDBC sync, which is already deprecated anyway.

CaesarWangX commented 3 months ago

Thanks @danny0405. 1.Our configuration does not explicitly set hoodie.datasource.hive_sync.mode. After enabling hive sync, we set hoodie.datasource.hive_sync.jdbcurl 2.Unfortunately, we are using Spark😅, and upon checking the code, I found that this part is hard code and cannot specify a value for the default partition. If I'm wrong, please correct me.

CaesarWangX commented 3 months ago

Actually, the first issue is not the main one. We are more concerned with the issue of default partition values

danny0405 commented 3 months ago

if possible, maybe you can fire a JIRA issue and contribute code to make the spark default partition value to be configurable, and I will be gald to review it.

CaesarWangX commented 3 months ago

@danny0405 sure, I can do that.