Open razajafri opened 1 month ago
There are a few places where we download archives manually without following best practice of verifying the checksum. Checking it makes it possible to fail the CI pipeline early with a meaningful error message;
in this case https://github.com/NVIDIA/spark-rapids/blob/11964aee01d9e43aeddad585440bb8a79611e45e/jenkins/databricks/test.sh#L97
For this particular release sha512 sum is in an unusual format
https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz.sha512
spark-3.2.0-bin-hadoop3.2.tgz: EBE51A44 9EBD070B E7D35709 31044070 E53C2307
6ABAD233 B3C51D45 A7C99326 CF55805E E0D573E6
EB7D6A67 CFEF1963 CD77D6DC 07DD2FD7 0FD60DA9
D1F79E5E
so it cannot be directly passed to sha512sum -c
but we can see it is the same checksum as in
$ sha512sum -b spark-3.2.0-bin-hadoop3.2.tgz
ebe51a449ebd070be7d3570931044070e53c23076abad233b3c51d45a7c99326cf55805ee0d573e6eb7d6a67cfef1963cd77d6dc07dd2fd70fd60da9d1f79e5e *spark-3.2.0-bin-hadoop3.2.tgz
Describe the bug Recently, I tried running the Databricks tests by running
jenkins/databricks/test.sh
which failed because the spark-3.2.0 tar ball is corrupted during download.Steps/Code to reproduce bug Build databricks by running
jenkins/databricks/build.sh
Run the tests by runningjenkins/databricks/test.sh
Expected behavior The tests should run or we should bypass the spark 3.2.0 tests if the tarball is corrupt.