NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.24k stars 1.29k forks source link

Tons of kernel errors when enabling MIG mode on A30 #427

Open fmoessbauer opened 1 year ago

fmoessbauer commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

525.60.13

Does this happen with the proprietary driver (of the same version) as well?

No

Operating System and Version

Debian GNU/Linux bookworm/sid

Kernel Release

Linux gputest 6.0.0-5-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.10-1 (2022-11-26) x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA A30 (UUID: GPU-<...>) MIG 4g.24gb Device 0: (UUID: MIG-<...>)

Describe the bug

After enabling the MIG mode and partitioning the GPU, on each access via CUDA or nvidia-smi, hundreds of error messages from the nvidia kernel module appear in the journal.

Dec 12 03:11:53 localhost kernel: nvidia: loading out-of-tree module taints kernel.
Dec 12 03:11:53 localhost kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 12 03:11:53 localhost kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  525.60.13  Release Build  (builder@297510dd0cf4)  Wed Dec  7 10:50:55 UTC 2022

Enable MIG mode, partition GPU as 4g.24gb (doesn't matter which profile / partition to use)

Dec 12 03:12:06 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:08 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
Dec 12 03:12:09 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:11 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
Dec 12 03:12:19 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257
Dec 12 03:12:19 gputest kernel: NVRM nvCheckOkFailedNoLog: Check failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from gisubscriptionGetGPUInstanceSubscription(pRsClient, RES_GET_HANDLE(pSubdevice), &pGPUInstanceSubscription) @ kernel_mig_manager.c:2257NVRM _kmigmgrHandlePreSchedulingDisableCallback: Invalidating valid gpu instance with swizzId = 0
[...]

Interestingly, the GPU can still be used from CUDA applications (e.g. deviceQuery), but these errors are still continuously written to the journal (dozens per second).

To Reproduce

  1. start the system with a non-partitioned GPU (A30 in our case)
  2. run /usr/bin/nvidia-smi mig -cgi 0
  3. check journal and see error messages from above

Bug Incidence

Always

nvidia-bug-report.log.gz

Not possible. The log contains too much sensitive data. If required, we can share it privately.

More Info

No response

SteavenGamerYT commented 3 months ago

same issue