NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] Verify the downloaded spark tarball on Databricks #11580

Open razajafri opened 1 month ago

razajafri commented 1 month ago

Describe the bug Recently, I tried running the Databricks tests by running jenkins/databricks/test.sh which failed because the spark-3.2.0 tar ball is corrupted during download.

Steps/Code to reproduce bug Build databricks by running jenkins/databricks/build.sh Run the tests by running jenkins/databricks/test.sh

Expected behavior The tests should run or we should bypass the spark 3.2.0 tests if the tarball is corrupt.

gerashegalov commented 1 month ago

There are a few places where we download archives manually without following best practice of verifying the checksum. Checking it makes it possible to fail the CI pipeline early with a meaningful error message;

in this case https://github.com/NVIDIA/spark-rapids/blob/11964aee01d9e43aeddad585440bb8a79611e45e/jenkins/databricks/test.sh#L97

For this particular release sha512 sum is in an unusual format

https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz.sha512

spark-3.2.0-bin-hadoop3.2.tgz: EBE51A44 9EBD070B E7D35709 31044070 E53C2307
                               6ABAD233 B3C51D45 A7C99326 CF55805E E0D573E6
                               EB7D6A67 CFEF1963 CD77D6DC 07DD2FD7 0FD60DA9
                               D1F79E5E

so it cannot be directly passed to sha512sum -c

but we can see it is the same checksum as in

$ sha512sum -b spark-3.2.0-bin-hadoop3.2.tgz
ebe51a449ebd070be7d3570931044070e53c23076abad233b3c51d45a7c99326cf55805ee0d573e6eb7d6a67cfef1963cd77d6dc07dd2fd70fd60da9d1f79e5e *spark-3.2.0-bin-hadoop3.2.tgz