Closed bcheena closed 4 months ago
/gcbrun
/gcbrun
/gcbrun
/gcbrun
upgrade_kernel method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.
oh gosh. I forgot that we put that into production. So sketchy...
Thanks for your comments @cjac! I kind of assumed that the upgrade_kernel() function in https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474 was working as intended. I now see that this reruns all startup scripts and initialization actions and might leave cluster in an unexpected state.
Well I tried creating a 2.0-rocky8 cluster today (4th dec) and somehow the running kernel version was already upgraded to 4.18.0-513.9.1.el8_9.x86_64
. The current workaround can be to skip this upgrade_kernel method altogether for now, but we should definitely revisit this later in a proper way - by adding checks in the agent to skip if startup script has already run once.
/gcbrun
dataproc-initialization-actions-presubmit-pr
seems to be failing with an unrelated error.
Looks like gcloud config get-value project
is unable to fetch the project-id cloud-dataproc-ci
? Not sure what changed - I can see one more PR failing with the same error.
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [ FAILED ] setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ======================================================================
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ----------------------------------------------------------------------
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/__main__/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/__main__/integration_tests/dataproc_test_case.py", line 62, in setUpClass
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z assert cls.PROJECT
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError
Hey there Cheena,
I've been taking with Gregory from Rocky. I think I should set up a call with them, Nvidia, and some representatives from the Dataproc team to discuss the problem.
I hope this helps me to remember to set it up!
C.J.
On Mon, Dec 4, 2023, 06:55 Cheena Budhiraja @.***> wrote:
dataproc-initialization-actions-presubmit-pr seems to be failing with an unrelated error.
Looks like gcloud config get-value project is unable to fetch the project-id cloud-dataproc-ci? Not sure what changed - I can see one more PR failing with the same error.
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15): Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3 Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [ FAILED ] setUpClass (main.NvidiaGpuDriverTestCase) Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ====================================================================== Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (main.NvidiaGpuDriverTestCase) Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ---------------------------------------------------------------------- Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last): Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 62, in setUpClass Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z assert cls.PROJECT Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/pull/1116#issuecomment-1838817796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UXME3CM4O7RNGULB2LYHXP5ZAVCNFSM6AAAAABADGWTHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYHAYTONZZGY . You are receiving this because you were mentioned.Message ID: @.*** com>
This is looking good. I'll test it next.
These changes have been tested with
We should exclude systemd from dnf update.
4.18.0-513.9.1.el8_9.x86_64
which does not match the running kernel version4.18.0-477.27.1.el8_8.x86_64
.There should be a condition to check if the kernel needs to be upgraded.
upgrade_kernel
method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.