GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

[gpu] Driver installation breaking in Dataproc 2.1 image during initialization #1189

Closed santhoshvly closed 4 months ago

santhoshvly commented 5 months ago

Hi Team,

I was able to attach GPUs to Dataproc 2.1 cluster and it was working fine after disabling the secure boot. I am using the latest install_gpu_driver.sh from this repository. But I am getting the following error during cluster initialization now:-

++ lsb_release -is ++ tr '[:upper:]' '[:lower:]'

ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.

ERROR: The nvidia kernel module was not created.

ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

I can also see the following error in the file /var/log/nvidia-installer.log in one of the cluster machine.

ld -r -o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-modeset-interface.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nvidia-modeset-linux.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-kthread-q.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-peermem.o MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Checking to see whether the nvidia kernel module was successfully built executing: 'cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build"'... make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Error. ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details. -> The command cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build" failed with the following output:

make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 ERROR: The nvidia kernel module was not created. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Is anyone facing similar issue with driver installation in Dataproc 2.1/2.2 clusters?

cjac commented 5 months ago

I could change the default driver and cuda versions on 2.1 images to be more current.

santhoshvly commented 5 months ago

@cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?

cjac commented 5 months ago

I don't think I've tested the current code with cuda 12, but I think that's what we should be targeting with a recent 5xx series driver.

I recently reimagined the installer to use , on bookworm and later, the stock dkms from non-,free packages and sign drivers using the MOK. That requires that the MOK x509 cert be inserted into the efi header of the block device. I'll be writing it up with some example code shortly.

I will try to set it up to do cuda 12 on a 5xx series kernel module, but I haven't tested it yet. In 2.2 we should be able to use the one from Debian stable non-free with dkms to install the current open module.

On Wed, Jun 12, 2024, 12:25 santhoshvly @.***> wrote:

@cjac https://github.com/cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2163747827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6USGZGNQTQ3ZR6XCEE3ZHCOA3AVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTG42DOOBSG4 . You are receiving this because you were mentioned.Message ID: @.*** com>

santhoshvly commented 5 months ago

@cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds

cjac commented 5 months ago

I'm seeing a gcc error when trying to link gpl-incompatible code into kernel modules for all variants available on Debian 11 ; Debian 12 offers open driver support so I will start there tomorrow.

On Thu, Jun 13, 2024, 06:49 santhoshvly @.***> wrote:

@cjac https://github.com/cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2165732651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UUNC5SUOIVPDJUSYMLZHGPNVAVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVG4ZTENRVGE . You are receiving this because you were mentioned.Message ID: @.*** com>

cjac commented 5 months ago

The latest Dataproc image that works with the .run file is 2.1.46-debian11

I am pushing a new change to the installer script. Please see #1190 for something that has been tested to work with images prior to and including 2.1.46-debian11

I am also working on bookworm (2.2-debian12) support for installation using apt-get.

cjac commented 5 months ago

I'm also work on something in the custom-images repo. I've got an in-progress PR open over there:

https://github.com/GoogleCloudDataproc/custom-images/pull/83

santhoshvly commented 5 months ago

@cjac Okay, Thank you for the update. We are unable to use GPU with latest 2.1/2.2 images until we get the fixed install_gpu_driver.sh. We always use the latest 2.1 image to launch the Dataproc cluster. Will this script change help attach the GPU to the latest 2.1 Debian 11 image (currently 2.1.53-debian11), or can we only use versions prior to and including 2.1.46-debian11?

santhoshvly commented 5 months ago

We have been running data pipelines using the latest Dataproc 2.1 images with GPU attached, and they have been breaking for some time. However, the documentation does not mention this issue: https://cloud.google.com/dataproc/docs/concepts/compute/gpus. This makes Dataproc GPU clusters seem very unreliable if they can break at any time.

cjac commented 5 months ago

Yes, I agree. I'm doing some work internally to build and distribute the kernel drivers with the stock image. I hope to have the change reviewed and published this quarter.

You are correct that the initialization-actions script will presently only work with those versions mentioned. I will do some work today to see if I can build drivers from bullseye-backports.

cjac commented 5 months ago

I've had some luck building from the open source github repo on the latest 2.1 images ; I'm integrating these changes into the open PR now.

cjac commented 5 months ago

The update is working on the latest 2.1 image.

Thu Jun 20 18:47:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   76C    P0             37W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
santhoshvly commented 5 months ago

Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?

cjac commented 5 months ago

The update is also working on 2.0 images:

Thu Jun 20 19:08:20 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   62C    P0             32W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
cjac commented 5 months ago

Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?

1190 is correct

santhoshvly commented 2 months ago

@cjac I am facing the following error while attaching GPU to Dataproc 2.2 cluster

The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.

The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.

The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.

The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.

cjac commented 2 months ago

TL;DR: Debian started enforcing deprecation of apt-key add; must move repo signing key to its own file and reference by path in sources.list file

I am fixing. You can find a workaround at the end of install_gpu_drivers.sh in my rapids work branch

https://github.com/cjac/initialization-actions/blob/e43a1eaa402dc8a81aa8853cafb32e906f72f80f/gpu/install_gpu_driver.sh#L1077

santhoshvly commented 2 months ago

@cjac Okay, Thank you. I will try this workaround.

cjac commented 2 months ago

You can likely use that whole file if extracting the function is too complicated.

santhoshvly commented 2 months ago

@cjac Okay,Thanks!. We have disabled secure boot in dataproc. Is that okay or should we enable it?.

santhoshvly commented 2 months ago

@cjac I tried with that workaround script you mentioned but still breaking with similar error in Dataproc 2.2

-----END PGP PUBLIC KEY BLOCK-----'

cjac commented 2 months ago

I didn't explicitly recommend that you run apt-get update after you fix the trust database. you'll still get the errors until you apt-get update to re-build the package cache. I'll encode that into the workaround.

cjac commented 2 months ago

package cache update command included in #1240 as commit 234515d

santhoshvly commented 2 months ago

@cjac I tried with package cache update, but getting same error:-

cAZUlaj3id3TxquAlud4lWDz =h5nH -----END PGP PUBLIC KEY BLOCK-----'