Open QuesarVII opened 2 months ago
Hey there! Thanks for the report, great find! Here's a fix:
diff --git a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
index 23c484c8..a98a7353 100644
--- a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
+++ b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
@@ -528,9 +528,9 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
while ((pGpu = gpumgrGetNextGpu(gpuAttachMask, &gpuIndex)) &&
(i < NV0000_CTRL_GPU_MAX_ATTACHED_GPUS))
{
- if (pGpu->gpuInstance >= NV2080_MAX_SUBDEVICES)
+ if (pGpu->gpuInstance >= NV_MAX_DEVICES)
{
- NV_ASSERT(pGpu->gpuInstance < NV2080_MAX_SUBDEVICES);
+ NV_ASSERT(pGpu->gpuInstance < NV_MAX_DEVICES);
continue;
}
@@ -542,7 +542,7 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
(j < NV0000_CTRL_GPU_MAX_VIDEO_LINKS))
{
if ((peerGpuIndex == gpuIndex) ||
- (pPeerGpu->gpuInstance >= NV2080_MAX_SUBDEVICES))
+ (pPeerGpu->gpuInstance >= NV_MAX_DEVICES))
{
continue;
}
NV2080_MAX_SUBDEVICES is the maximum number of subdevices in a single SLI group, which is not relevant here. NV_MAX_DEVICES is the maximum number of GPUs in the system, which is what gpuInstance
represents.
We'll include this fix in a future release.
As far as I can tell, this only has a minor impact on OpenGL displaying if multiple GPUs have displays connected to them. It should be entirely irrelevant for CUDA workloads. Except for the dmesg spam, obviously.
NV bug reference: 4817640
I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.
By the way, the error is present there as well, it's just not routed to dmesg and so not end user visible.
This patch resolved those errors. In reading the code it looked like it should have been testing vs max devices instead of max subdevices but I wasn't sure. Thanks for the fix!
NVIDIA Open GPU Kernel Modules Version
560.28.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04
Kernel Release
6.8.0-40-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 4500 Ada - quantity 10
Describe the bug
[ 40.046929] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.046960] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.227147] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.227179] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.441196] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.441227] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.542925] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.542955] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.840865] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.840896] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.972728] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 40.972759] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.157421] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.157459] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.356296] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.356326] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.557332] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.557362] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.724534] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533 [ 41.724565] NVRM: nvAssertFailedNoLog: Assertion fai-iled: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
To Reproduce
We are using a Supermicro 4125gs-tnrt with (10) RTX 4500 Ada GPUs. The provided errors from dmesg occur during initialization upon boot of the system.
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
The Ubuntu cuda and cuda-toolkit-12-6 packages from the repo are requiring the nvidia-open packages instead of the dependency being an "either/or" on nvidia-driver-560 or nvidia-open-560. I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.