GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
585 stars 515 forks source link

[gpu] Driver installation breaking in Dataproc 2.1 image during initialization #1189

Open santhoshvly opened 1 month ago

santhoshvly commented 1 month ago

Hi Team,

I was able to attach GPUs to Dataproc 2.1 cluster and it was working fine after disabling the secure boot. I am using the latest install_gpu_driver.sh from this repository. But I am getting the following error during cluster initialization now:-

++ lsb_release -is ++ tr '[:upper:]' '[:lower:]'

ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.

ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.

ERROR: The nvidia kernel module was not created.

ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

I can also see the following error in the file /var/log/nvidia-installer.log in one of the cluster machine.

ld -r -o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-modeset-interface.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nvidia-modeset-linux.o /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset/nv-kthread-q.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-modeset.o LD [M] /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/nvidia-peermem.o MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Checking to see whether the nvidia kernel module was successfully built executing: 'cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build"'... make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 -> Error. ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details. -> The command cd ./kernel; /opt/conda/default/bin/make -k -j8 NV_KERNEL_MODULES="nvidia" NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/5.10.0-30-cloud-amd64/source" SYSOUT="/lib/modules/5.10.0-30-cloud-amd64/build" failed with the following output:

make[1]: Entering directory '/usr/src/linux-headers-5.10.0-30-common' make[2]: Entering directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' scripts/Makefile.lib:8: 'always' is deprecated. Please use 'always-y' instead MODPOST /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers FATAL: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[3]: [/usr/src/linux-headers-5.10.0-30-common/scripts/Makefile.modpost:123: /tmp/selfgz13916/NVIDIA-Linux-x86_64-495.29.05/kernel/Module.symvers] Error 1 make[3]: Target '__modpost' not remade because of errors. make[2]: [/usr/src/linux-headers-5.10.0-30-common/Makefile:1783: modules] Error 2 make[2]: Leaving directory '/usr/src/linux-headers-5.10.0-30-cloud-amd64' make[1]: [Makefile:192: __sub-make] Error 2 make[1]: Target 'modules' not remade because of errors. make[1]: Leaving directory '/usr/src/linux-headers-5.10.0-30-common' make: [Makefile:80: modules] Error 2 ERROR: The nvidia kernel module was not created. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Is anyone facing similar issue with driver installation in Dataproc 2.1/2.2 clusters?

cjac commented 1 month ago

I could change the default driver and cuda versions on 2.1 images to be more current.

santhoshvly commented 1 month ago

@cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?

cjac commented 1 month ago

I don't think I've tested the current code with cuda 12, but I think that's what we should be targeting with a recent 5xx series driver.

I recently reimagined the installer to use , on bookworm and later, the stock dkms from non-,free packages and sign drivers using the MOK. That requires that the MOK x509 cert be inserted into the efi header of the block device. I'll be writing it up with some example code shortly.

I will try to set it up to do cuda 12 on a 5xx series kernel module, but I haven't tested it yet. In 2.2 we should be able to use the one from Debian stable non-free with dkms to install the current open module.

On Wed, Jun 12, 2024, 12:25 santhoshvly @.***> wrote:

@cjac https://github.com/cjac Thank you! Is there a specific CUDA and driver version to try as a workaround to get past this error in 2.1 images now?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2163747827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6USGZGNQTQ3ZR6XCEE3ZHCOA3AVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTG42DOOBSG4 . You are receiving this because you were mentioned.Message ID: @.*** com>

santhoshvly commented 4 weeks ago

@cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds

cjac commented 4 weeks ago

I'm seeing a gcc error when trying to link gpl-incompatible code into kernel modules for all variants available on Debian 11 ; Debian 12 offers open driver support so I will start there tomorrow.

On Thu, Jun 13, 2024, 06:49 santhoshvly @.***> wrote:

@cjac https://github.com/cjac Okay, Thank you!. I tried with latest CUDA version 12.0 and corresponding driver version using the latest install_gpu_driver.sh script from this repo, but got the same error. So, it looks like we can't really attach any GPUs to Dataproc 2.1/2.2 until we fix it. Please let me know if there are any other workarounds

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/issues/1189#issuecomment-2165732651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UUNC5SUOIVPDJUSYMLZHGPNVAVCNFSM6AAAAABJG5NGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVG4ZTENRVGE . You are receiving this because you were mentioned.Message ID: @.*** com>

cjac commented 3 weeks ago

The latest Dataproc image that works with the .run file is 2.1.46-debian11

I am pushing a new change to the installer script. Please see #1190 for something that has been tested to work with images prior to and including 2.1.46-debian11

I am also working on bookworm (2.2-debian12) support for installation using apt-get.

cjac commented 3 weeks ago

I'm also work on something in the custom-images repo. I've got an in-progress PR open over there:

https://github.com/GoogleCloudDataproc/custom-images/pull/83

santhoshvly commented 3 weeks ago

@cjac Okay, Thank you for the update. We are unable to use GPU with latest 2.1/2.2 images until we get the fixed install_gpu_driver.sh. We always use the latest 2.1 image to launch the Dataproc cluster. Will this script change help attach the GPU to the latest 2.1 Debian 11 image (currently 2.1.53-debian11), or can we only use versions prior to and including 2.1.46-debian11?

santhoshvly commented 3 weeks ago

We have been running data pipelines using the latest Dataproc 2.1 images with GPU attached, and they have been breaking for some time. However, the documentation does not mention this issue: https://cloud.google.com/dataproc/docs/concepts/compute/gpus. This makes Dataproc GPU clusters seem very unreliable if they can break at any time.

cjac commented 3 weeks ago

Yes, I agree. I'm doing some work internally to build and distribute the kernel drivers with the stock image. I hope to have the change reviewed and published this quarter.

You are correct that the initialization-actions script will presently only work with those versions mentioned. I will do some work today to see if I can build drivers from bullseye-backports.

cjac commented 3 weeks ago

I've had some luck building from the open source github repo on the latest 2.1 images ; I'm integrating these changes into the open PR now.

cjac commented 3 weeks ago

The update is working on the latest 2.1 image.

Thu Jun 20 18:47:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   76C    P0             37W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
santhoshvly commented 3 weeks ago

Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?

cjac commented 3 weeks ago

The update is also working on 2.0 images:

Thu Jun 20 19:08:20 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   62C    P0             32W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
cjac commented 3 weeks ago

Okay, cool. Thank you so much!. So, we should be able to attach the GPU to latest 2.1 once you merge this PR, https://github.com/GoogleCloudDataproc/initialization-actions/pull/1190. Is that correct?

1190 is correct