Closed santhoshvly closed 4 months ago
I could change the default driver and cuda versions on 2.1 images to be more current.
@cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?
I don't think I've tested the current code with cuda 12, but I think that's what we should be targeting with a recent 5xx series driver.
I recently reimagined the installer to use , on bookworm and later, the stock dkms from non-,free packages and sign drivers using the MOK. That requires that the MOK x509 cert be inserted into the efi header of the block device. I'll be writing it up with some example code shortly.
I will try to set it up to do cuda 12 on a 5xx series kernel module, but I haven't tested it yet. In 2.2 we should be able to use the one from Debian stable non-free with dkms to install the current open module.
On Wed, Jun 12, 2024, 12:25 santhoshvly @.***> wrote:
@cjac https://github.com/cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2163747827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6USGZGNQTQ3ZR6XCEE3ZHCOA3AVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTG42DOOBSG4 . You are receiving this because you were mentioned.Message ID: @.*** com>
@cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds
I'm seeing a gcc error when trying to link gpl-incompatible code into kernel modules for all variants available on Debian 11 ; Debian 12 offers open driver support so I will start there tomorrow.
On Thu, Jun 13, 2024, 06:49 santhoshvly @.***> wrote:
@cjac https://github.com/cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2165732651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UUNC5SUOIVPDJUSYMLZHGPNVAVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVG4ZTENRVGE . You are receiving this because you were mentioned.Message ID: @.*** com>
The latest Dataproc image that works with the .run file is 2.1.46-debian11
I am pushing a new change to the installer script. Please see #1190 for something that has been tested to work with images prior to and including 2.1.46-debian11
I am also working on bookworm (2.2-debian12) support for installation using apt-get.
I'm also work on something in the custom-images repo. I've got an in-progress PR open over there:
https://github.com/GoogleCloudDataproc/custom-images/pull/83
@cjac Okay, Thank you for the update. We are unable to use GPU with latest 2.1/2.2 images until we get the fixed install_gpu_driver.sh
. We always use the latest 2.1 image to launch the Dataproc cluster. Will this script change help attach the GPU to the latest 2.1 Debian 11 image (currently 2.1.53-debian11), or can we only use versions prior to and including 2.1.46-debian11?
We have been running data pipelines using the latest Dataproc 2.1 images with GPU attached, and they have been breaking for some time. However, the documentation does not mention this issue: https://cloud.google.com/dataproc/docs/concepts/compute/gpus. This makes Dataproc GPU clusters seem very unreliable if they can break at any time.
Yes, I agree. I'm doing some work internally to build and distribute the kernel drivers with the stock image. I hope to have the change reviewed and published this quarter.
You are correct that the initialization-actions script will presently only work with those versions mentioned. I will do some work today to see if I can build drivers from bullseye-backports.
I've had some luck building from the open source github repo on the latest 2.1 images ; I'm integrating these changes into the open PR now.
The update is working on the latest 2.1 image.
Thu Jun 20 18:47:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 76C P0 37W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?
The update is also working on 2.0 images:
Thu Jun 20 19:08:20 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 62C P0 32W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?
@cjac I am facing the following error while attaching GPU to Dataproc 2.2 cluster
The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
The following packages have unmet dependencies: systemd : Depends: libsystemd0 (= 252.26-1~deb12u2) E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
TL;DR: Debian started enforcing deprecation of apt-key add; must move repo signing key to its own file and reference by path in sources.list file
I am fixing. You can find a workaround at the end of install_gpu_drivers.sh in my rapids work branch
@cjac Okay, Thank you. I will try this workaround.
You can likely use that whole file if extracting the function is too complicated.
@cjac Okay,Thanks!. We have disabled secure boot in dataproc. Is that okay or should we enable it?.
@cjac I tried with that workaround script you mentioned but still breaking with similar error in Dataproc 2.2
-----END PGP PUBLIC KEY BLOCK-----'
I didn't explicitly recommend that you run apt-get update after you fix the trust database. you'll still get the errors until you apt-get update to re-build the package cache. I'll encode that into the workaround.
package cache update command included in #1240 as commit 234515d
@cjac I tried with package cache update, but getting same error:-
cAZUlaj3id3TxquAlud4lWDz =h5nH -----END PGP PUBLIC KEY BLOCK-----'
Hi Team,
I was able to attach GPUs to Dataproc 2.1 cluster and it was working fine after disabling the secure boot. I am using the latest install_gpu_driver.sh from this repository. But I am getting the following error during cluster initialization now:-
++ lsb_release -is ++ tr '[:upper:]' '[:lower:]'
ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.
ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
I can also see the following error in the file /var/log/nvidia-installer.log in one of the cluster machine.
ld -r -o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-modeset-interface.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nvidia-modeset-linux.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-kthread-q.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-peermem.o MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Checking to see whether the nvidia kernel module was successfully built executing: 'cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build"'... make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Error. ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details. -> The command
cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build"
failed with the following output:make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 ERROR: The nvidia kernel module was not created. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Is anyone facing similar issue with driver installation in Dataproc 2.1/2.2 clusters?