NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.21k stars 1.28k forks source link

nvidia-open-560.28.03 gives assertion error in dmesg with 10 RTX 4500 GPUs #694

Open QuesarVII opened 2 months ago

QuesarVII commented 2 months ago

NVIDIA Open GPU Kernel Modules Version

560.28.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04

Kernel Release

6.8.0-40-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

RTX 4500 Ada - quantity 10

Describe the bug

[ 40.046929] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.046960] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.227147] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.227179] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.441196] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.441227] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.542925] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.542955] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.840865] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.840896] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.972728] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.972759] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.157421] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.157459] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.356296] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.356326] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.557332] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.557362] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.724534] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.724565] NVRM: nvAssertFailedNoLog: Assertion fai-iled: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533

To Reproduce

We are using a Supermicro 4125gs-tnrt with (10) RTX 4500 Ada GPUs. The provided errors from dmesg occur during initialization upon boot of the system.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

The Ubuntu cuda and cuda-toolkit-12-6 packages from the repo are requiring the nvidia-open packages instead of the dependency being an "either/or" on nvidia-driver-560 or nvidia-open-560. I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.

mtijanic commented 2 months ago

Hey there! Thanks for the report, great find! Here's a fix:

diff --git a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
index 23c484c8..a98a7353 100644
--- a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
+++ b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
@@ -528,9 +528,9 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
     while ((pGpu = gpumgrGetNextGpu(gpuAttachMask, &gpuIndex)) &&
            (i < NV0000_CTRL_GPU_MAX_ATTACHED_GPUS))
     {
-        if (pGpu->gpuInstance >= NV2080_MAX_SUBDEVICES)
+        if (pGpu->gpuInstance >= NV_MAX_DEVICES)
         {
-            NV_ASSERT(pGpu->gpuInstance < NV2080_MAX_SUBDEVICES);
+            NV_ASSERT(pGpu->gpuInstance < NV_MAX_DEVICES);
             continue;
         }

@@ -542,7 +542,7 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
                (j < NV0000_CTRL_GPU_MAX_VIDEO_LINKS))
         {
             if ((peerGpuIndex == gpuIndex) ||
-                (pPeerGpu->gpuInstance >= NV2080_MAX_SUBDEVICES))
+                (pPeerGpu->gpuInstance >= NV_MAX_DEVICES))
             {
                 continue;
             }

NV2080_MAX_SUBDEVICES is the maximum number of subdevices in a single SLI group, which is not relevant here. NV_MAX_DEVICES is the maximum number of GPUs in the system, which is what gpuInstance represents.

We'll include this fix in a future release.

As far as I can tell, this only has a minor impact on OpenGL displaying if multiple GPUs have displays connected to them. It should be entirely irrelevant for CUDA workloads. Except for the dmesg spam, obviously.

NV bug reference: 4817640

mtijanic commented 2 months ago

I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.

By the way, the error is present there as well, it's just not routed to dmesg and so not end user visible.

QuesarVII commented 2 months ago

This patch resolved those errors. In reading the code it looked like it should have been testing vs max devices instead of max subdevices but I wasn't sure. Thanks for the fix!