GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

Fix install_gpu_driver.sh failures in rocky 2.0 and 2.1 images #1116

Closed bcheena closed 4 months ago

bcheena commented 12 months ago
  1. install_gpu_driver.sh init script fails with the following error in rocky 2.0 and 2.1 images:
    ++ dnf -y -q update
    Error: 
    Problem: The operation would result in removing the following protected packages: systemd

We should exclude systemd from dnf update.

  1. In 2.0 rocky, the available version of kernel-devel is 4.18.0-513.9.1.el8_9.x86_64 which does not match the running kernel version 4.18.0-477.27.1.el8_8.x86_64.
++ dnf -y -q install kernel-devel-4.18.0-477.27.1.el8_8.x86_64
Error: Unable to find a match: kernel-devel-4.18.0-477.27.1.el8_8.x86_64

There should be a condition to check if the kernel needs to be upgraded.

upgrade_kernel method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.

bcheena commented 12 months ago

/gcbrun

bcheena commented 12 months ago

/gcbrun

bcheena commented 12 months ago

/gcbrun

cjac commented 12 months ago

/gcbrun

cjac commented 12 months ago

upgrade_kernel method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.

oh gosh. I forgot that we put that into production. So sketchy...

bcheena commented 11 months ago

Thanks for your comments @cjac! I kind of assumed that the upgrade_kernel() function in https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474 was working as intended. I now see that this reruns all startup scripts and initialization actions and might leave cluster in an unexpected state.

Well I tried creating a 2.0-rocky8 cluster today (4th dec) and somehow the running kernel version was already upgraded to 4.18.0-513.9.1.el8_9.x86_64. The current workaround can be to skip this upgrade_kernel method altogether for now, but we should definitely revisit this later in a proper way - by adding checks in the agent to skip if startup script has already run once.

bcheena commented 11 months ago

/gcbrun

bcheena commented 11 months ago

dataproc-initialization-actions-presubmit-pr seems to be failing with an unrelated error.

Looks like gcloud config get-value project is unable to fetch the project-id cloud-dataproc-ci? Not sure what changed - I can see one more PR failing with the same error.

Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [  FAILED  ] setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ======================================================================
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ----------------------------------------------------------------------
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/__main__/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/__main__/integration_tests/dataproc_test_case.py", line 62, in setUpClass
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z     assert cls.PROJECT
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError
cjac commented 11 months ago

Hey there Cheena,

I've been taking with Gregory from Rocky. I think I should set up a call with them, Nvidia, and some representatives from the Dataproc team to discuss the problem.

I hope this helps me to remember to set it up!

C.J.

On Mon, Dec 4, 2023, 06:55 Cheena Budhiraja @.***> wrote:

dataproc-initialization-actions-presubmit-pr seems to be failing with an unrelated error.

Looks like gcloud config get-value project is unable to fetch the project-id cloud-dataproc-ci? Not sure what changed - I can see one more PR failing with the same error.

Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15): Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3 Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [ FAILED ] setUpClass (main.NvidiaGpuDriverTestCase) Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ====================================================================== Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (main.NvidiaGpuDriverTestCase) Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ---------------------------------------------------------------------- Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last): Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 62, in setUpClass Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z assert cls.PROJECT Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/pull/1116#issuecomment-1838817796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UXME3CM4O7RNGULB2LYHXP5ZAVCNFSM6AAAAABADGWTHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYHAYTONZZGY . You are receiving this because you were mentioned.Message ID: @.*** com>

cjac commented 4 months ago

This is looking good. I'll test it next.

cjac commented 4 months ago

These changes have been tested with