NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
783 stars 228 forks source link

Improve Databricks runtime shim detection #8587

Open gerashegalov opened 1 year ago

gerashegalov commented 1 year ago

We currently rely on the prefix of the version strings

https://github.com/NVIDIA/spark-rapids/blob/8b752454ca27557bcfebc5a98d459f039ee5bee0/sql-plugin/src/main/spark332db/scala/com/nvidia/spark/rapids/shims/spark332db/SparkShimServiceProvider.scala#L34-L36

whose values are documented in the spark-versions API. These versions represent wildcards for the latest patch of the major.minor such as 11.3.x.

Thus, a user of an older rapids-4-spark artifact may hit a runtime bug or worse, a silent defect, instead of a clear actionable message as implemented in #8521

Spark UI on DBR displays "Build Properties" in the Environment Tab:

Name  Value
Runtime Build Hash 383fa9ccdbf99891a97ff2c546d4330d923a6d82
Universe Build Hash e3f8b198b7c7c313f95719b7f41d3503780d4a4d

These values correspond to

org.apache.spark.BuildInfo.gitHash
com.databricks.BuildInfo.gitHash

in a Scala notebook

which we can be utilized in the patch version detection.

gerashegalov commented 8 months ago

We can also improve the reliability of the released spark-rapids jars using the nightly pipeline of the pending release.

We know that semi-monthly/bi-weekly maintenance updates to Databricks Runtimes can break released spark-rapids plugin code. Usually it is more subtle than just breaking the API https://github.com/NVIDIA/spark-rapids/pull/10070#issuecomment-1862013962.

Ideally we want to retest the jar version that is already used by customers upon every maintenance update. However, testing is time consuming. So we do not want to retest last N releases nightly. Say we do it on a weekly schedule. And say due to an unfortunate sequencing the test runs just before the DBR update push, it may take another week for the next run to catch new issues.

However, we can utilize the fact that our pending release runs nightly tests on DBR to detect whether we need to kick off released artifacts tests.

We can maintain a table mapping DB buildver to last tested build hashes

DB buildver DBR hashes tested
spark321db
spark330db
spark332db
spark341db

Somewhere in the source code we will have a test or ./integration_tests/run_pyspark_from_build.sh log the current values org.apache.spark.BuildInfo.gitHash, com.databricks.BuildInfo.gitHash

then the CI can compare it to the last known value for the DB shim based on the table and kick off a pipeline for released test jars automatically, then update the table. This should shorten the window of detection to a couple of a days.

gerashegalov commented 1 month ago

Update: The P0 part of this issue is to log details

org.apache.spark.BuildInfo com.databricks.BuildInfo

and potentially more details as documented for the SQL function select current_version; https://docs.databricks.com/en/sql/language-manual/functions/current_version.html#returns

This should be logged via Databricks shim service providers com.nvidia.spark.rapids.shims.spark3XYdb.SparkShimServiceProvider

and in the CI logs

pxLi commented 1 month ago

link to https://github.com/NVIDIA/spark-rapids/issues/11184