NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.17k stars 1.27k forks source link

RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock #583

Open FlyGoat opened 10 months ago

FlyGoat commented 10 months ago

NVIDIA Open GPU Kernel Modules Version

545.29.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Arch Linux

Kernel Release

6.6.7

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 4060 (10de:28e0)

Describe the bug

dmesg spam with:

 NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

To Reproduce

Boot on such system, and check dmesg.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I tried to debug the issue by appending a os_stack_trace() after where the message is printed, I got the following backtrace:

[  854.764480] NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock
[  854.764483] CPU: 15 PID: 5542 Comm: kworker/15:0 Tainted: G           OE      6.6.7-arch1-1 #1 4505c4baa0b3d7c4037b0e8f5402626fa360717f
[  854.764486] Hardware name: ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402XV_GA402XV/GA402XV, BIOS GA402XV.313 08/10/2023
[  854.764487] Workqueue: pm pm_runtime_work
[  854.764490] Call Trace:
[  854.764491]  <TASK>
[  854.764492]  dump_stack_lvl+0x47/0x60
[  854.764496]  rmapiAllocWithSecInfo+0x306/0x410 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764574]  ? srso_alias_return_thunk+0x5/0x7f
[  854.764575]  ? __kmem_cache_alloc_node+0x1a6/0x340
[  854.764577]  ? os_alloc_mem+0xc8/0xe0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764631]  ? os_alloc_mem+0xc8/0xe0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764684]  ? srso_alias_return_thunk+0x5/0x7f
[  854.764686]  ? __kmalloc+0x50/0x150
[  854.764689]  rmapiAlloc+0x27/0x40 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764758]  memdescSendMemDescToGSP+0x171/0x2c0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764837]  ? memdescSendMemDescToGSP+0x120/0x2c0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764912]  fbsrCopyMemoryMemDesc_GM107+0x46a/0xe80 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.764988]  ? _issueRpcAndWait+0x3c/0x210 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765074]  _memmgrWalkHeap+0x156/0x680 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765146]  memmgrSavePowerMgmtState_KERNEL+0x18b/0x320 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765218]  gpuPowerManagementEnter.constprop.0+0x6a/0x2e0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765295]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765298]  gpuEnterStandby_IMPL+0x109/0x280 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765369]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765372]  RmPowerManagementInternal+0x113/0x1a0 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765451]  RmGcxPowerManagement+0x2fc/0x360 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765522]  ? rmGpuLocksAcquire+0xbb/0x130 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765601]  rm_transition_dynamic_power+0x83/0x122 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765671]  ? srso_alias_return_thunk+0x5/0x7f
[  854.765676]  nv_pmops_runtime_suspend+0x6f/0x100 [nvidia 17e5ed799529ff7ce72e190f3462b0a04291b9d9]
[  854.765726]  pci_pm_runtime_suspend+0x67/0x1e0
[  854.765728]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765730]  __rpm_callback+0x41/0x170
[  854.765732]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765734]  rpm_callback+0x5d/0x70
[  854.765736]  ? __pfx_pci_pm_runtime_suspend+0x10/0x10
[  854.765738]  rpm_suspend+0x120/0x6a0
[  854.765740]  ? __pfx_pci_pm_runtime_idle+0x10/0x10
[  854.765742]  pm_runtime_work+0x84/0xb0
[  854.765745]  process_one_work+0x171/0x340
[  854.765747]  worker_thread+0x27b/0x3a0
[  854.765749]  ? __pfx_worker_thread+0x10/0x10
[  854.765750]  kthread+0xe5/0x120
[  854.765752]  ? __pfx_kthread+0x10/0x10
[  854.765754]  ret_from_fork+0x31/0x50
[  854.765756]  ? __pfx_kthread+0x10/0x10
[  854.765758]  ret_from_fork_asm+0x1b/0x30
[  854.765762]  </TASK>

Backtraces are almost the same each time it printed.

relief-melone commented 10 months ago

Can confirm the issue on NixOS as well. Has been happening for earlier kernel versions on my system as well but currently it is Nvidia: Hardware: GeForce 3080Ti (mobile) Version: 545.29.06 Kernel: 6.6.8

In my case most of the time this happens when waking up again from suspend or the machine sitting around unused for some time

TZECHIN6 commented 9 months ago

Same here, just hangup after updating to 545 today

Possible Fix

After showing the above error message, I found a line at the very end hdaudio hdaudioCOD2:unable to configure disabling. After searching online, I tried the fix as below, and able to log back into ubuntu normally.

# use recovery mode and enter root shell
# nano /etc/default/grub and make below change

# original
"GRUB_LINUX_DEFAULT="quiet splash nomodeset"

# to
"GRUB_LINUX_DEFAULT="quiet splash"

# then `update-grub` and `reboot`
FlyGoat commented 9 months ago

Nvidia folks, any chance we can get this fixed in next release?

apolopena commented 8 months ago

This bug is not fun! At a minimum its an indefinite delay and at worst its a crash on shutdown with no ability to open a tty and save the day.

This gets spammed in dmesg:

NVRM rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock

I also get this spam in dmesg too:

[  122.596895] NVRM nbsiReadRegistryDword: osReadRegistryDword called in Sleep path can cause excessive delays!
[  122.596903] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ nbsi_osrg.c:107

$ cat /proc/driver/nvidia/version outputs: NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 545.23.08 Release Build (dvs-builder@U16-I3-A16-1-1)

mtijanic commented 8 months ago

Hi, thanks for the report! This is tracked internally as bug 4074148. Hard to say which release will get the fix due to schedules and release branching.

For whatever it's worth, the root cause of the print from that particular call stack (rm_power_management()) was found to be "harmless", except for the print spam. Any other issues you are seeing are likely to be independent and deserve a separate bug report.

scaronni commented 6 months ago

Still happening with 550.78.

[   15.007542] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.78  Release Build  (dvs-builder@U16-I1-N08-06-4)  Sun Apr 14 06:38:24 UTC 2024
[   15.302560] NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11.
[   36.803975] NVRM: rmapiAllocWithSecInfo: RMAPI_GPU_LOCK_INTERNAL alloc requested without holding the RMAPI lock