After enabling the MIG mode and partitioning the GPU, on each access via CUDA or nvidia-smi, hundreds of error messages from the nvidia kernel module appear in the journal.
Dec 12 03:11:53 localhost kernel: nvidia: loading out-of-tree module taints kernel.
Dec 12 03:11:53 localhost kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 12 03:11:53 localhost kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 525.60.13 Release Build (builder@297510dd0cf4) Wed Dec 7 10:50:55 UTC 2022
Enable MIG mode, partition GPU as 4g.24gb (doesn't matter which profile / partition to use)
Dec 12 03:12:06 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:08 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
Dec 12 03:12:09 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:11 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
Dec 12 03:12:19 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:19 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
[...]
Interestingly, the GPU can still be used from CUDA applications (e.g. deviceQuery), but these errors are still continuously written to the journal (dozens per second).
To Reproduce
start the system with a non-partitioned GPU (A30 in our case)
run /usr/bin/nvidia-smi mig -cgi 0
check journal and see error messages from above
Bug Incidence
Always
nvidia-bug-report.log.gz
Not possible. The log contains too much sensitive data. If required, we can share it privately.
NVIDIA Open GPU Kernel Modules Version
525.60.13
Does this happen with the proprietary driver (of the same version) as well?
No
Operating System and Version
Debian GNU/Linux bookworm/sid
Kernel Release
Linux gputest 6.0.0-5-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.10-1 (2022-11-26) x86_64 GNU/Linux
Hardware: GPU
GPU 0: NVIDIA A30 (UUID: GPU-<...>) MIG 4g.24gb Device 0: (UUID: MIG-<...>)
Describe the bug
After enabling the MIG mode and partitioning the GPU, on each access via CUDA or nvidia-smi, hundreds of error messages from the nvidia kernel module appear in the journal.
Enable MIG mode, partition GPU as 4g.24gb (doesn't matter which profile / partition to use)
Interestingly, the GPU can still be used from CUDA applications (e.g. deviceQuery), but these errors are still continuously written to the journal (dozens per second).
To Reproduce
/usr/bin/nvidia-smi mig -cgi 0
Bug Incidence
Always
nvidia-bug-report.log.gz
Not possible. The log contains too much sensitive data. If required, we can share it privately.
More Info
No response