apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.2k stars 2.38k forks source link

[SUPPORT] URI too long error #11446

Open michael1991 opened 3 weeks ago

michael1991 commented 3 weeks ago

Describe the problem you faced

I'm using Spark3.5 + Hudi0.15.0 for partitioned table, when I choose req_date and req_hour for partition column name, I will get this error, but task would be executed successfully finally; when I choose date and hour for partition column name, error disappeared.

Expected behavior

We should get no errors when we just make partition column names a bit longer.

Environment Description

Stacktrace

2024-06-13 13:21:13 ERROR PriorityBasedFileSystemView:129 - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: URI Too Long
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:447) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:465) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.lambda$loadPartitions$6e5c444d$1(PriorityBasedFileSystemView.java:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.loadPartitions(PriorityBasedFileSystemView.java:185) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:133) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:174) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:200) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleCleaning(HoodieSparkCopyOnWriteTable.java:212) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:647) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:746) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:847) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:581) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.mayBeCleanAndArchive(BaseHoodieWriteClient.java:560) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:251) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1082) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) ~[spark-sql-api_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:449) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [scala-library-2.12.18.jar:?]
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [scala-library-2.12.18.jar:?]
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [scala-library-2.12.18.jar:?]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1032) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1124) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1133) [spark-core_2.12-3.5.0.jar:3.5.0]
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.12-3.5.0.jar:3.5.0]
Caused by: org.apache.hudi.org.apache.http.client.HttpResponseException: URI Too Long
    at org.apache.hudi.org.apache.http.impl.client.AbstractResponseHandler.handleResponse(AbstractResponseHandler.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.org.apache.http.client.fluent.Response.handleResponse(Response.java:90) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.org.apache.http.client.fluent.Response.returnContent(Response.java:97) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:189) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:445) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
    ... 71 more
ad1happy2go commented 3 weeks ago

@michael1991 Thanks for raising this. Can you help me to reproduce this issue. I tried below but it was working fine for me.

fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
         "FullName": fake.name(), "Address": fake.address(),
         "CompanyName": fake.company(), "JobTitle": fake.job(),
         "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
         "RandomText": fake.sentence(), "CityNameDummyBigFieldName": fake.city(),  "ts":"1",
         "StateNameDummyBigFieldName": fake.state(), "Country": fake.country()} for _ in range(1000)]
pandas_df = pd.DataFrame(data)

hoodie_properties = {
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.write.recordkey.field': 'ID',
    'hoodie.datasource.write.partitionpath.field': 'StateNameDummyBigFieldName,CityNameDummyBigFieldName',
    'hoodie.table.name' : 'test'

}
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)

for i in range(1, 50):
    df.write.format("hudi").options(**hoodie_properties).mode("append").save(PATH)
michael1991 commented 3 weeks ago

Hi @ad1happy2go , glad to hear you again ~ Can you try column name with underscore, i'm not sure if enable urlencode for partition and partition column name with underscore could make this happen.

ad1happy2go commented 3 weeks ago

@michael1991 How many number of partitions in the table? Is it possible to get the URI? I was not able to reproduce this though.

michael1991 commented 3 weeks ago

@ad1happy2go Partitions are hours, for example, gs://bucket/tables/hudi/r_date=2024-06-17/r_hour=00. But problem only occurs on two partitions and underscore, we are using one partition column like yyyyMMddHH and it's going on well. Not sure the exact cause.

ad1happy2go commented 2 weeks ago

Can you try reproducing this issue with the sample code. @michael1991 , That will help us to triage it better