Open pxLi opened 1 month ago
UPDATE: this is repro only on Azure 13.3 runtime
@pxLi are we building the DBR 13.3 shim on AWS Databricks and then running tests on Azure Databricks?
At first I could not replicate this on Azure Databricks, but then I discovered I was using a different Azure Databricks URL than the one CI is using. When I use the same Azure Databricks URL I'm able to replicate the issue, which implies there is a change that has been pushed to one Databricks Azure workspace but not another that breaks the plugin.
@pxLi are we building the DBR 13.3 shim on AWS Databricks and then running tests on Azure Databricks?
Yes we do, and we run the build+deploy after we pass the tests on AWS databricks runtime, and we will run IT only on azure every 2 weekdays to double confirm our plugin could work on different csp DB instances, and apparently this time azure 13.3 LTS runtime is not identical as AWS. and as Jason mentioned, different URLs could result different runtimes even all in azure
Now it failed AWS 13.3 runtime too. Looks like the runtime has been rolled out...
current hashes from select current_version();
azure 13.3:
{"dbr_version":null,"dbsql_version":null,"u_build_hash":"80cb8aa4b7284dc3c0f8047e102517d3f6326f84","r_build_hash":"4e8b4bdede528ea22ac005b80e72035f5cd0b293"}
aws 13.3:
{"dbr_version":null,"dbsql_version":null,"u_build_hash":"80cb8aa4b7284dc3c0f8047e102517d3f6326f84","r_build_hash":"4e8b4bdede528ea22ac005b80e72035f5cd0b293"}
Unfortunately, we didn't record the hashes of previous images for comparison.
It turns out everything works fine when both build and test are done in the same runtime. The error seems to be related to the inconsistent runtime versions caused by Databrick's way to upgrade the runtimes (so we failed using artifact an older AWS runtime but testing on an upgraded Azure runtime, which we cannot control)
and now AWS(our CI region) got upgraded after 2/3 days of Azure one
close as all 13.3 runtimes become consistent in the AWS and Azure regions that we are using.
Saw this again in the nightly Azure Databricks test run.
The most recent failure may be related to using a stale artifact that was built before the runtime was updated. Leaving this open until the nightly Azure Databricks test pipeline succeeds.
Saw this again in the nightly Azure Databricks test run.
I think you were seeing this in the 24.06.0 jar (post release test), we are preparing and will release 24.06.1 soon to fix the issue.
I will close this after 24.06.1 gets released
24.06.1 has been released and passed the post-release CI today.
Please file a new ticket if any new issues arise due to the DB runtime upgrade. Thank you!
OK, now we met the same issue. the new-built plugin could not support old runtime (users could still use a cluster with old images or DB does not plan to update their csp regions, they do not want to share us the plan/cadance) and we do not have old/specific images to rebuild it with the latest fix
unless we keep DB clusters forever for each hashe-versions or ask users to build DB shims directly on their ENV or force them to stop all jobs and restart cluster (new a cluster) to come up with the new runtime image
To add a maintenance update to an existing cluster, restart the cluster.
For the maintenance updates on unsupported Databricks Runtime versions,
see Maintenance updates for Databricks Runtime (archived).
cc @sameerz @GaryShen2008
Going forward we need to consider supporting the old and new APIs for changes made in the Databricks environments. Otherwise users will face problems taking a newer jar and running on an already existing older cluster.
moving this to 24.10 for further discussion if needed
Describe the bug started on Jul 13, a lot of our IT cases started failing in DB 13.3 runtime
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLjava/lang/String;Lscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;ZZLscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lorg/apache/spark/sql/catalyst/catalog/DeltaRuntimeProperties;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/immutable/Set;Lscala/Option;)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;
e.g. cases
Steps/Code to reproduce bug run parquet cases on databricks 13.3 runtime
Expected behavior Pass the test
Environment details (please complete the following information)
Additional context Add any other context about the problem here.