NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
783 stars 228 forks source link

[BUG] java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.CatalogTable... in Databricks 13.3 runtimes #11184

Open pxLi opened 1 month ago

pxLi commented 1 month ago

Describe the bug started on Jul 13, a lot of our IT cases started failing in DB 13.3 runtime

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLjava/lang/String;Lscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;ZZLscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lorg/apache/spark/sql/catalyst/catalog/DeltaRuntimeProperties;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lscala/Option;)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;

[2024-07-13T16:01:05.647Z] E                   : java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLjava/lang/String;Lscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;ZZLscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lorg/apache/spark/sql/catalyst/catalog/DeltaRuntimeProperties;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lscala/Option;)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;

[2024-07-13T16:01:05.647Z] E                    at org.apache.spark.sql.rapids.shims.GpuCreateDataSourceTableAsSelectCommand.run(GpuCreateDataSourceTableAsSelectCommandShims.scala:89)

[2024-07-13T16:01:05.647Z] E                    at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult$lzycompute(GpuExecutedCommandExec.scala:52)

[2024-07-13T16:01:05.648Z] E                    at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult(GpuExecutedCommandExec.scala:50)

[2024-07-13T16:01:05.648Z] E                    at com.nvidia.spark.rapids.GpuExecutedCommandExec.executeCollect(GpuExecutedCommandExec.scala:61)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$3(QueryExecution.scala:286)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:166)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$2(QueryExecution.scala:286)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$9(SQLExecution.scala:303)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:533)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:226)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1148)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:155)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:482)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:285)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$withMVTagsIfNecessary(QueryExecution.scala:259)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:280)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:265)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:465)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:69)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:465)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:39)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:339)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:335)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:39)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:39)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:441)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:265)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:395)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:265)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:217)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:214)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:356)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:956)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:797)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:774)

[2024-07-13T16:01:05.648Z] E                    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:654)

[2024-07-13T16:01:05.648Z] E                    at sun.reflect.GeneratedMethodAccessor444.invoke(Unknown Source)

[2024-07-13T16:01:05.648Z] E                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2024-07-13T16:01:05.648Z] E                    at java.lang.reflect.Method.invoke(Method.java:498)

[2024-07-13T16:01:05.648Z] E                    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

[2024-07-13T16:01:05.648Z] E                    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)

[2024-07-13T16:01:05.648Z] E                    at py4j.Gateway.invoke(Gateway.java:306)

[2024-07-13T16:01:05.648Z] E                    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

[2024-07-13T16:01:05.648Z] E                    at py4j.commands.CallCommand.execute(CallCommand.java:79)

[2024-07-13T16:01:05.648Z] E                    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)

[2024-07-13T16:01:05.648Z] E                    at py4j.ClientServerConnection.run(ClientServerConnection.java:119)

[2024-07-13T16:01:05.648Z] E                    at java.lang.Thread.run(Thread.java:750)

e.g. cases


[2024-07-13T16:01:05.648Z] FAILED ../../src/main/python/parquet_test.py::test_buckets[-reader_confs0][DATAGEN_SEED=1720883956, TZ=UTC, IGNORE_ORDER, ALLOW_NON_GPU(DataWritingCommandExec,ExecutedCommandExec,WriteFilesExec)] - py4j.protocol.Py4JJavaError: An error occurred while calling o551437.saveAs...
[2024-07-13T16:01:05.648Z] FAILED ../../src/main/python/parquet_test.py::test_buckets[-reader_confs1][DATAGEN_SEED=1720883956, TZ=UTC, IGNORE_ORDER, ALLOW_NON_GPU(DataWritingCommandExec,ExecutedCommandExec,WriteFilesExec)] - py4j.protocol.Py4JJavaError: An error occurred while calling o551645.saveAs...
[2024-07-13T16:01:05.648Z] FAILED ../../src/main/python/parquet_test.py::test_buckets[-reader_confs2][DATAGEN_SEED=1720883956, TZ=UTC, IGNORE_ORDER, ALLOW_NON_GPU(DataWritingCommandExec,ExecutedCommandExec,WriteFilesExec)] - py4j.protocol.Py4JJavaError: An error occurred while calling 

[2024-07-13T15:01:08.386Z] =========================== short test summary info ============================

[2024-07-13T15:01:08.386Z] FAILED ../../src/main/python/explain_test.py::test_explain_bucketd_scan[DATAGEN_SEED=1720882809, TZ=UTC, ALLOW_NON_GPU(ANY)] - py4j.protocol.Py4JJavaError: An error occurred while calling o735.saveAsTable.
[2024-07-13T15:01:08.386Z] FAILED ../../src/main/python/explain_test.py::test_explain_bucket_column_not_read[DATAGEN_SEED=1720882809, TZ=UTC, ALLOW_NON_GPU(ANY)] - py4j.protocol.Py4JJavaError: An error occurred while calling o839.saveAsTable.

Steps/Code to reproduce bug run parquet cases on databricks 13.3 runtime

Expected behavior Pass the test

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

pxLi commented 1 month ago

UPDATE: this is repro only on Azure 13.3 runtime

sameerz commented 1 month ago

@pxLi are we building the DBR 13.3 shim on AWS Databricks and then running tests on Azure Databricks?

jlowe commented 1 month ago

At first I could not replicate this on Azure Databricks, but then I discovered I was using a different Azure Databricks URL than the one CI is using. When I use the same Azure Databricks URL I'm able to replicate the issue, which implies there is a change that has been pushed to one Databricks Azure workspace but not another that breaks the plugin.

pxLi commented 1 month ago

@pxLi are we building the DBR 13.3 shim on AWS Databricks and then running tests on Azure Databricks?

Yes we do, and we run the build+deploy after we pass the tests on AWS databricks runtime, and we will run IT only on azure every 2 weekdays to double confirm our plugin could work on different csp DB instances, and apparently this time azure 13.3 LTS runtime is not identical as AWS. and as Jason mentioned, different URLs could result different runtimes even all in azure

pxLi commented 1 month ago

Now it failed AWS 13.3 runtime too. Looks like the runtime has been rolled out...

current hashes from select current_version();

azure 13.3:
{"dbr_version":null,"dbsql_version":null,"u_build_hash":"80cb8aa4b7284dc3c0f8047e102517d3f6326f84","r_build_hash":"4e8b4bdede528ea22ac005b80e72035f5cd0b293"}
aws 13.3:
{"dbr_version":null,"dbsql_version":null,"u_build_hash":"80cb8aa4b7284dc3c0f8047e102517d3f6326f84","r_build_hash":"4e8b4bdede528ea22ac005b80e72035f5cd0b293"}

Unfortunately, we didn't record the hashes of previous images for comparison.

It turns out everything works fine when both build and test are done in the same runtime. The error seems to be related to the inconsistent runtime versions caused by Databrick's way to upgrade the runtimes (so we failed using artifact an older AWS runtime but testing on an upgraded Azure runtime, which we cannot control)

and now AWS(our CI region) got upgraded after 2/3 days of Azure one

pxLi commented 1 month ago

close as all 13.3 runtimes become consistent in the AWS and Azure regions that we are using.

jlowe commented 1 month ago

Saw this again in the nightly Azure Databricks test run.

jlowe commented 1 month ago

The most recent failure may be related to using a stale artifact that was built before the runtime was updated. Leaving this open until the nightly Azure Databricks test pipeline succeeds.

pxLi commented 1 month ago

Saw this again in the nightly Azure Databricks test run.

I think you were seeing this in the 24.06.0 jar (post release test), we are preparing and will release 24.06.1 soon to fix the issue.

I will close this after 24.06.1 gets released

pxLi commented 1 month ago

24.06.1 has been released and passed the post-release CI today.

Please file a new ticket if any new issues arise due to the DB runtime upgrade. Thank you!

pxLi commented 1 month ago

OK, now we met the same issue. the new-built plugin could not support old runtime (users could still use a cluster with old images or DB does not plan to update their csp regions, they do not want to share us the plan/cadance) and we do not have old/specific images to rebuild it with the latest fix

unless we keep DB clusters forever for each hashe-versions or ask users to build DB shims directly on their ENV or force them to stop all jobs and restart cluster (new a cluster) to come up with the new runtime image

ref: https://docs.databricks.com/en/release-notes/runtime/maintenance-updates.html#databricks-runtime-maintenance-updates

To add a maintenance update to an existing cluster, restart the cluster. 
For the maintenance updates on unsupported Databricks Runtime versions, 
see Maintenance updates for Databricks Runtime (archived).

cc @sameerz @GaryShen2008

sameerz commented 1 month ago

Going forward we need to consider supporting the old and new APIs for changes made in the Databricks environments. Otherwise users will face problems taking a newer jar and running on an already existing older cluster.

pxLi commented 1 month ago

moving this to 24.10 for further discussion if needed