NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.24k stars 1.29k forks source link

Suspend sometimes causes a crash when using the open 555.52.04 drivers #662

Open urbenlegend opened 5 months ago

urbenlegend commented 5 months ago

NVIDIA Open GPU Kernel Modules Version

555.52.04

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Arch Linux

Kernel Release

Linux arch-desktop 6.9.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 12 Jun 2024 20:17:17 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 3090

Describe the bug

Sometimes when I attempt to suspend-to-ram, the machine will fail to suspend and instead get stuck on a black screen. I have to hard reset the machine in order to get it back. In the the system logs, there is a crash call trace for the Nvidia driver: suspend_hang.txt

To Reproduce

It happens rarely and randomly. I don't know exactly what causes it. Most of the time it can suspend fine, but sometimes it will crash

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

I've uploaded the generated bug report, but I am not sure if it includes the crash since I had to reboot before I could run the nvidia-bug-report.sh command. Doing a quick search in the log indicates that the crash trace is not in it. That is why I uploaded a separate suspend_hang.txt which does include the crash logs from the previous boot.

More Info

No response

aritger commented 5 months ago

Thank you for the report. I've filed NVIDIA internal bug 4706166 to track this.

If you're willing to rebuild the open kernel modules, could you please apply this patch, and then upload the system log after the problem reproduces again? Thanks!

$ cat 0001-instrumentation-for-suspend-crash.patch 
From 44afc9067af6df0671724e37b8f2c2cde7386590 Mon Sep 17 00:00:00 2001
From: Andy Ritger <aritger@nvidia.com>
Date: Mon, 17 Jun 2024 15:03:14 -0700
Subject: [PATCH] instrumentation for suspend crash
X-NVConfidentiality: public

---
 kernel-open/nvidia/nv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel-open/nvidia/nv.c b/kernel-open/nvidia/nv.c
index 99792de96307..bc003399cd83 100644
--- a/kernel-open/nvidia/nv.c
+++ b/kernel-open/nvidia/nv.c
@@ -3111,7 +3111,8 @@ nv_map_guest_pages(nv_alloc_t *at,
     if (pages == NULL)
     {
         nv_printf(NV_DBG_ERRORS,
-                  "NVRM: failed to allocate vmap() page descriptor table!\n");
+                  "NVRM: failed to allocate vmap() page descriptor table! (page_count: %d)\n", page_count);
+        dump_stack();
         return 0;
     }

@@ -3604,7 +3605,8 @@ void* NV_API_CALL nv_alloc_kernel_mapping(
             if (pages == NULL)
             {
                 nv_printf(NV_DBG_ERRORS,
-                          "NVRM: failed to allocate vmap() page descriptor table!\n");
+                          "NVRM: failed to allocate vmap() page descriptor table! (page_count:%d)\n", page_count);
+                dump_stack();
                 return NULL;
             }

-- 
2.44.0
urbenlegend commented 4 months ago

Thanks for the patch. I am currently on the proprietary 555.58.02 module because I need to avoid the slowdowns in KDE caused by the GSP firmware, so I have not run into this sleep issue again. Once the GSP bug is resolved, I will switch to the open module again and apply the patch to see what's going on.

abfipes12 commented 4 months ago

I am on proprietary nvidia 555.58.02-1 driver and I have the same problems that are listed there, I use Arch Linux, NVIDIA GeForce RTX™ 3050 Laptop GPU

I have tried linux 6.9.7 (or) linux-lts 6.6.37 NVreg_EnableS0ixPowerManagement (or) NVreg_PreserveVideoMemoryAllocations on /var/tmp (over 250GB space left) nvidia_drm.modeset 0 (or) 1 as boot parameter nvidia_drm.fbdev 0 (or) 1 as boot parameter X11 (or) Xwayland

nothing helped, (except module_blacklist=nvidia)

Jul 06 03:35:17 archlinux kernel: NVRM: failed to allocate vmap() page descriptor table!
Jul 06 03:35:17 archlinux kernel: NVRM: GPU at PCI:0000:01:00: GPU-887a46df-29b2-be1c-8c55-e637117338ba
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) and data 0x000000002080205b 0x0000000000000004.
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Jul 06 03:35:17 archlinux kernel: NVRM:      0    76   GSP_RM_CONTROL        0x000000002080205b 0x0000000000000004 0x00061c8a2d4a3f2a 0x0000000000000000          y
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x00061c8a2d32031f 0x00061c8a2d34fd27 195080us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    10   FREE                  0x00000000c1e016c0 0x0000000000000000 0x00061c8a2d320088 0x00061c8a2d3202fc    628us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    10   FREE                  0x000000000000000a 0x0000000000000000 0x00061c8a2d31fa32 0x00061c8a2d320087   1621us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    10   FREE                  0x000000000000000b 0x0000000000000000 0x00061c8a2d31f763 0x00061c8a2d31f943    480us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    10   FREE                  0x0000000000000006 0x0000000000000000 0x00061c8a2d31f52a 0x00061c8a2d31f75e    564us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    10   FREE                  0x0000000000000002 0x0000000000000000 0x00061c8a2d31e4fe 0x00061c8a2d31f4fd   4095us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    10   FREE                  0x0000000000000005 0x0000000000000000 0x00061c8a2d31da4b 0x00061c8a2d31e4fb   2736us  
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Jul 06 03:35:17 archlinux kernel: NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2d32b15b 0x00061c8a2d32b15c      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x00061c8a2d324d79 0x00061c8a2d324d7b      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d2fb48c 0x00061c8a2d2fb48c           
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d24eef8 0x00061c8a2d24eef9      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce35a01 0x00061c8a2ce35a01           
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce357ff 0x00061c8a2ce35800      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x00061c8a2ce33db1 0x00061c8a2ce33db3      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    4098 GSP_RUN_CPU_SEQUENCER 0x000000000000060a 0x0000000000003fe2 0x00061c8a2ce27b5b 0x00061c8a2ce28c8b   4400us  
Jul 06 03:35:17 archlinux kernel: CPU: 4 PID: 8874 Comm: kworker/u48:11 Tainted: P           OE      6.9.7-arch1-1 #1 44783200744f92500e6484c6d93590bc19db4a83
Jul 06 03:35:17 archlinux kernel: Hardware name: Micro-Star International Co., Ltd. Thin GF63 12UC/MS-16R8, BIOS E16R8IMS.111 03/21/2024
Jul 06 03:35:17 archlinux kernel: Workqueue: async async_run_entry_fn
Jul 06 03:35:17 archlinux kernel: Call Trace:
Jul 06 03:35:17 archlinux kernel:  <TASK>
Jul 06 03:35:17 archlinux kernel:  dump_stack_lvl+0x5d/0x80
Jul 06 03:35:17 archlinux kernel:  _nv012672rm+0x437/0x4b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv012592rm+0x74/0x330 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv046348rm+0x49f/0x7f0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv049583rm+0xa1/0x150 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045638rm+0x19e/0x1b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv047612rm+0x3fc/0x500 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv014430rm+0x42e/0x690 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045777rm+0x26/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000751rm+0x55/0x70 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000750rm+0x21b/0x220 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000701rm+0x2ad/0x300 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  rm_power_management+0x22c/0x260 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  ? wait_for_completion+0x91/0x170
Jul 06 03:35:17 archlinux kernel:  nv_power_management+0x92/0x170 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nvidia_suspend+0x6c/0x100 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nv_pmops_suspend+0x15/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  pci_pm_suspend+0x7c/0x170
Jul 06 03:35:17 archlinux kernel:  ? __pfx_pci_pm_suspend+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  dpm_run_callback+0x47/0x150
Jul 06 03:35:17 archlinux kernel:  device_suspend+0x141/0x510
Jul 06 03:35:17 archlinux kernel:  ? try_to_wake_up+0x76/0x660
Jul 06 03:35:17 archlinux kernel:  async_suspend+0x1d/0x30
Jul 06 03:35:17 archlinux kernel:  async_run_entry_fn+0x31/0x140
Jul 06 03:35:17 archlinux kernel:  process_one_work+0x18b/0x350
Jul 06 03:35:17 archlinux kernel:  worker_thread+0x2eb/0x410
Jul 06 03:35:17 archlinux kernel:  ? __pfx_worker_thread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  kthread+0xcf/0x100
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork+0x31/0x50
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork_asm+0x1a/0x30
Jul 06 03:35:17 archlinux kernel:  </TASK>
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a81 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a76 0x2).
Jul 06 03:35:17 archlinux kernel: NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: pci_pm_suspend(): nv_pmops_suspend+0x0/0x30 [nvidia] returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: failed to suspend async: error -5
Jul 06 03:35:17 archlinux kernel: PM: Some devices failed to suspend, or early wake event detected
Jul 06 03:35:17 archlinux kernel: iwlwifi 0000:00:14.3: WRT: Invalid buffer destination
Jul 06 03:35:17 archlinux kernel: done.
belegdol commented 4 months ago

Also seeing this on RTX 2070 with the proprietary 555.58.02 driver.

anandadfoxx commented 4 months ago

I also had this issue when the GPU devices suspending by PCI-E Power Management, this can be reproduced by activating NVIDIA Drain Mode (for hybrid notebooks). I am also using NVIDIA Open Kernel Modules 555.58.02.

sudo nvidia-smi drain -p 0000:01:00.0 -m 1 (my PCI ID for the GPU is 0000:01:00.0)

Distribution: Arch Linux x86_64 GPU: NVIDIA GeForce RTX 3050 Mobile CPU: Intel Core i5-12500H

❯ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
error

This kind of error also happened whether using the GSP offload or not (nvidia.NVreg_EnableGpuFirmware=0 kernel parameter)

nvidia-gspoff.log nvidia-gspon.log

cubusXD commented 3 months ago

A similar thing is happening to me on the proprietary drivers 555.58.02. I've tried it on the latest LTS kernel but that doesn't seem to resolve the issue.

nvidia-bug-report.log.gz hang.txt

Gert-dev commented 3 months ago

Same for me, it almost always reproduces - on occasion it manages to get into suspend, but usually the power LED just stays on of the laptop, pressing something turns on the fans again, but the screen never comes back.

nvidia-bug-report.log.gz - captured after forcibly rebooting after the issue occurred.

Here are some kernel logs as well as they don't appear to be part of the bug report generated above:

KernelLogs.txt

As can be seen, I see a bunch of warnings and kernel backtraces around nv_set_system_power_state happening before the suspend fails:

aug 02 18:57:31 hephaestus kernel: ---[ end trace 0000000000000000 ]---
aug 02 18:57:31 hephaestus kernel:  </TASK>
aug 02 18:57:31 hephaestus kernel: R13: 00005b578e6140d0 R14: 00007eed2a6085c0 R15: 00007eed2a605ea0
aug 02 18:57:31 hephaestus kernel: R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
aug 02 18:57:31 hephaestus kernel: RBP: 00007fff7a951ad0 R08: 0000000000000410 R09: 0000000000000001
aug 02 18:57:31 hephaestus kernel: RDX: 0000000000000008 RSI: 00005b578e6140d0 RDI: 0000000000000001
aug 02 18:57:31 hephaestus kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007eed2a52c7a4
aug 02 18:57:31 hephaestus kernel: RSP: 002b:00007fff7a951aa8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
aug 02 18:57:31 hephaestus kernel: Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
aug 02 18:57:31 hephaestus kernel: RIP: 0033:0x7eed2a52c7a4
aug 02 18:57:31 hephaestus kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? do_syscall_64+0x8e/0x190
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? syscall_exit_to_user_mode+0x73/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  do_syscall_64+0x82/0x190
aug 02 18:57:31 hephaestus kernel:  __x64_sys_write+0x72/0xf0
aug 02 18:57:31 hephaestus kernel:  ? __do_sys_newfstat+0xc7/0x100
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  vfs_write+0xe6/0x4a0
aug 02 18:57:31 hephaestus kernel:  proc_reg_write+0x5a/0xa0
aug 02 18:57:31 hephaestus kernel:  nv_procfs_write_suspend+0xef/0x170 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  nv_set_system_power_state+0x1cd/0x470 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  nv_revoke_gpu_mappings_locked+0x47/0x70 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  unmap_mapping_range+0x116/0x140
aug 02 18:57:31 hephaestus kernel:  zap_page_range_single+0x222/0x260
aug 02 18:57:31 hephaestus kernel:  untrack_pfn+0x59/0x160
aug 02 18:57:31 hephaestus kernel:  follow_phys+0x49/0x110
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? asm_exc_invalid_op+0x1a/0x20
aug 02 18:57:31 hephaestus kernel:  ? exc_invalid_op+0x19/0xc0
aug 02 18:57:31 hephaestus kernel:  ? handle_bug+0x3c/0x80
aug 02 18:57:31 hephaestus kernel:  ? report_bug+0xe7/0x210
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? __warn.cold+0x8e/0xf3
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  <TASK>
aug 02 18:57:31 hephaestus kernel: Call Trace:
aug 02 18:57:31 hephaestus kernel: PKRU: 55555554
aug 02 18:57:31 hephaestus kernel: CR2: 00007eed2a608650 CR3: 000000010677a000 CR4: 0000000000f50ef0
aug 02 18:57:31 hephaestus kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
aug 02 18:57:31 hephaestus kernel: FS:  00007eed2a3a3b80(0000) GS:ffff9e95cdb00000(0000) knlGS:0000000000000000
aug 02 18:57:31 hephaestus kernel: R13: ffff9e8ed0721080 R14: ffffbdef1247fce0 R15: ffffffffffffffff
aug 02 18:57:31 hephaestus kernel: R10: 00007709f8e52fff R11: ffff9e8f0e5ad600 R12: ffffbdef1247fb70
aug 02 18:57:31 hephaestus kernel: RBP: ffffbdef1247fb78 R08: 0000000000000020 R09: ffffffffffffffff
aug 02 18:57:31 hephaestus kernel: RDX: ffffbdef1247fb70 RSI: 00007709f34b6000 RDI: ffff9e8efb21a8a0
aug 02 18:57:31 hephaestus kernel: RAX: 0000000000000000 RBX: 00007709f34b6000 RCX: ffffbdef1247fb78
aug 02 18:57:31 hephaestus kernel: RSP: 0018:ffffbdef1247fb38 EFLAGS: 00010246
aug 02 18:57:31 hephaestus kernel: Code: e9 ee 8a f1 00 48 25 00 00 00 c0 48 09 d0 c4 e2 f8 f2 c7 75 20 e8 5e e3 ff ff 48 8b 15 57 fa 72 01 48 81 e2 00 00 00 c0 eb 8c <0f> 0b 48 3b 1f 0f 83 6c fe ff ff 41 be ea ff ff ff eb b6 48 8b 7d
aug 02 18:57:31 hephaestus kernel: RIP: 0010:follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel: Hardware name: LENOVO 82WS/LNVNB161216, BIOS LPCN51WW 04/22/2024
aug 02 18:57:31 hephaestus kernel: CPU: 26 PID: 3496 Comm: nvidia-sleep.sh Tainted: G        W  OE      6.10.2-arch1-1.1 #1 856328b22fcd0da354f276ff67275d0fcc220438
aug 02 18:57:31 hephaestus kernel:  ucsi_acpi btintel snd_rn_pci_acp3x libarc4 snd_pcm kvm_amd vboxdrv(OE) typec_ucsi ideapad_laptop snd_acp_config btbcm realtek nvidia_modeset(OE) cfg80211 typec videobuf2_common r8152 snd_timer btmtk sp5100_tco pkcs8_key_parser snd_soc_acpi mdio_devres sparse_keymap hid_multitouch kvm bluetooth crc16 mii mc rapl wdat_wdt pcspkr wmi_bmof k10temp i2c_piix4 snd snd_pci_acp3x libphy rfkill roles mousedev joydev apple_mfi_fastcharge nvidia_uvm(OE) soundcore legion_laptop(OE) i2c_hid_acpi platform_profile crc8 i2c_hid mac_hid nvidia(OE) i2c_dev crypto_user loop nfnetlink ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee hid_generic usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel serio_raw sha512_ssse3 atkbd sha256_ssse3 libps2 sha1_ssse3 vivaldi_fmap aesni_intel nvme crypto_simd cryptd nvme_core xhci_pci ccp i8042 xhci_pci_renesas nvme_auth video serio wmi
aug 02 18:57:31 hephaestus kernel: Modules linked in: overlay snd_seq_dummy snd_hrtimer snd_seq vfat fat r8153_ecm cdc_ether usbnet mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir amd_atl intel_rapl_msr snd_sof_amd_acp intel_rapl_common snd_sof_pci snd_hda_codec_realtek snd_sof_xtensa_dsp snd_hda_codec_generic snd_sof snd_hda_scodec_component snd_sof_utils snd_hda_codec_hdmi snd_pci_ps snd_amd_sdw_acpi soundwire_amd snd_hda_scodec_tas2781_i2c soundwire_generic_allocation snd_soc_tas2781_fmwlib snd_hda_intel soundwire_bus uvcvideo snd_soc_tas2781_comlib snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi videobuf2_vmalloc snd_soc_core snd_rpl_pci_acp6x uvc snd_hda_codec snd_usbmidi_lib snd_acp_pci vboxnetflt(OE) videobuf2_memops snd_ump vboxnetadp(OE) snd_acp_legacy_common snd_compress snd_hda_core btusb videobuf2_v4l2 snd_rawmidi snd_pci_acp6x ac97_bus mac80211 btrtl snd_hwdep snd_seq_device snd_pci_acp5x snd_pcm_dmaengine nvidia_drm(OE) videodev r8169
aug 02 18:57:31 hephaestus kernel: WARNING: CPU: 26 PID: 3496 at include/linux/rwsem.h:80 follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel: ------------[ cut here ]------------

Followed by a bunch of NVRM errors:

aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL @ kern_bus_gv100.c:388
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:72
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkUnmap: Failed to unmap VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == unmapStatus @ mmu_walk_sparse.c:95
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkSparsify: Unmap failed with status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:72
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkUnmap: Failed to unmap VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_sparse.c:84
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkSparsify: Failed to sparsify VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: [drm:__nv_drm_semsurf_wait_fence_work_cb [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register auto-value-update on pre-wait value for sync FD semaphore surface

I can switch between s2idle and deep sleep modes, and both exhibit the same problems. I also tested with S0ix enabled and without, but to no avail.

Gunther-Schulz commented 3 months ago

I get the same problem "NVRM: failed to allocate vmap() page descriptor table!". I tested the proprietary 555 and the open 560 beta driver.

It seems like the issue happens if I have less free system RAM than I have VRAM.

My system:

Aug 03 23:32:58 lillypod systemd[1]: Reached target Sleep.
Aug 03 23:32:58 lillypod systemd[1]: Starting Suspend gnome-shell...
Aug 03 23:32:58 lillypod systemd[1]: gnome-shell-suspend.service: Deactivated successfully.
Aug 03 23:32:58 lillypod systemd[1]: Finished Suspend gnome-shell.
Aug 03 23:32:58 lillypod systemd[1]: Starting NVIDIA system suspend actions...
Aug 03 23:32:58 lillypod suspend[113863]: nvidia-suspend.service
Aug 03 23:32:58 lillypod logger[113863]: <13>Aug  3 23:32:58 suspend: nvidia-suspend.service
Aug 03 23:32:58 lillypod wireplumber[2282]: wplua: [string "alsa.lua"]:182: attempt to concatenate a nil value (local 'node_name')
                                            stack traceback:
                                                    [string "alsa.lua"]:182: in function <[string "alsa.lua"]:175>
Aug 03 23:32:58 lillypod wireplumber[2282]: wplua: [string "alsa.lua"]:182: attempt to concatenate a nil value (local 'node_name')
                                            stack traceback:
                                                    [string "alsa.lua"]:182: in function <[string "alsa.lua"]:175>
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/ldac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aptx_hd
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_hd
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aptx
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_g
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_g
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/sbc
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/sbc
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_1
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_0
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_duplex_1
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_duplex_0
Aug 03 23:32:58 lillypod kernel: rfkill: input handler enabled
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/faststream
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/faststream_duplex
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_05
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_05
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_05_duplex
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_05_duplex
Aug 03 23:33:08 lillypod systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Aug 03 23:33:09 lillypod kernel: NVRM: failed to allocate vmap() page descriptor table!
Aug 03 23:33:09 lillypod kernel: ------------[ cut here ]------------
Aug 03 23:33:09 lillypod kernel: WARNING: CPU: 2 PID: 113865 at /var/lib/dkms/nvidia/560.28.03/build/nvidia/nv.c:4598 nv_set_system_power_state+0x40d/0x470 [nvidia]
Aug 03 23:33:09 lillypod kernel: Modules linked in: dm_crypt cbc encrypted_keys trusted asn1_encoder tee rfcomm cmac algif_hash algif_skcipher af_alg hid_logitech_hidpp bnep intel_rapl_msr intel_rapl_common btusb btrtl btintel btbcm btmtk bluetooth snd_seq_dummy ip6table_filter ip>
Aug 03 23:33:09 lillypod kernel:  i2c_piix4 k10temp libcrc32c snd libphy soundcore gpio_amdpt gpio_generic mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) i2c_dev crypto_user fuse dm_mod loop nfnetlink bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid >
Aug 03 23:33:09 lillypod kernel: CPU: 2 PID: 113865 Comm: nvidia-sleep.sh Tainted: P           OE      6.6.41-1-MANJARO #1 3ef3dc680c6ec404f036b5609d7802e8bb7ca22a
Aug 03 23:33:09 lillypod kernel: Hardware name: ASRock B650M Pro RS WiFi/B650M Pro RS WiFi, BIOS 3.01 05/13/2024
Aug 03 23:33:09 lillypod kernel: RIP: 0010:nv_set_system_power_state+0x40d/0x470 [nvidia]
Aug 03 23:33:09 lillypod kernel: Code: 0f eb 40 4d 8b a4 24 f8 05 00 00 4d 85 e4 74 33 49 8b bc 24 d0 02 00 00 ba 01 00 00 00 89 de e8 59 c9 ff ff 89 c5 85 c0 74 d9 <0f> 0b 48 c7 c7 80 05 fc c0 41 bd 01 00 00 00 e8 cf 55 9d ce e9 f9
Aug 03 23:33:09 lillypod kernel: RSP: 0018:ffffc9000149ba80 EFLAGS: 00010206
Aug 03 23:33:09 lillypod kernel: RAX: 000000000000ffff RBX: 0000000000000001 RCX: 0000000080020000
Aug 03 23:33:09 lillypod kernel: RDX: ffff888100d705d8 RSI: 0000000000000286 RDI: ffff888100d705d0
Aug 03 23:33:09 lillypod kernel: RBP: 000000000000ffff R08: 0000000000000000 R09: 0000000080020000
Aug 03 23:33:09 lillypod kernel: R10: ffff8881e723b000 R11: 0000000000000000 R12: ffff888100d70000
Aug 03 23:33:09 lillypod kernel: R13: ffff888100d705d0 R14: ffff8881e7238000 R15: ffff8881e7238000
Aug 03 23:33:09 lillypod kernel: FS:  00007f999991bb80(0000) GS:ffff888ffe680000(0000) knlGS:0000000000000000
Aug 03 23:33:09 lillypod kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 03 23:33:09 lillypod kernel: CR2: 000000c000be2000 CR3: 00000001102c0000 CR4: 0000000000f50ee0
Aug 03 23:33:09 lillypod kernel: PKRU: 55555554
Aug 03 23:33:09 lillypod kernel: Call Trace:
Aug 03 23:33:09 lillypod kernel:  <TASK>
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  ? __warn+0x81/0x130
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  ? report_bug+0x16f/0x1a0
Aug 03 23:33:09 lillypod kernel:  ? handle_bug+0x3c/0x80
Aug 03 23:33:09 lillypod kernel:  ? exc_invalid_op+0x17/0x70
Aug 03 23:33:09 lillypod kernel:  ? asm_exc_invalid_op+0x1a/0x20
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  nv_procfs_write_suspend+0xe1/0x160 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  proc_reg_write+0x5a/0xa0
-------------------------------- SNIP ------------------------------------------
josefwells commented 3 months ago

I seem to have similar sleep issues, but I found this, wondering if anyone has tried: https://gist.github.com/bmcbm/375f14eaa17f88756b4bdbbebbcfd029

If I keep GPU usage low when I sleep it seems to be ok as well, but this other sleep stuff might be getting in the way..

birdie-github commented 2 months ago

Count me in.

Happens every time on suspend when using 4070S + Linux 6.10 and NVIDIA driver 560.35.03:

kernel-backtrace.txt

At first I thought it was specific to the proprietary driver but nope, affects the open source driver as well.

And trying to disable nvidia-sleep.sh results in a system unable to resume (the screen doesn't turn on).

josefwells commented 2 months ago

Suggestions in the other bug helped me. (Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate) and adding the options to the modprobe.d file.

Nvidia-open, which I read is recommended by Nvidia at 555.* and beyond.

4080 super, desktop.

birdie-github commented 2 months ago

Suggestions in the other bug helped me.

Which ones?

(Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate)

Already enabled.

and adding the options to the modprobe.d file.

Which ones?

Brensom commented 2 months ago

I have the same error.

josefwells commented 2 months ago

Suggestions in the other bug helped me.

Which ones?

These.

(Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate)

Already enabled.

and adding the options to the modprobe.d file.

Which ones?

options nvidia-drm fbdev=1 options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp

Now I am pretty sure that nvidia-drm is both wrong (nvidia_drm) and not needed, but also that it "works".

The others may have helped or it may just have been enabling the nvidia-* services. Sounds like you are seeing issues, so trying the additional options might help out.

birdie-github commented 2 months ago

fbdev=1 used to have a ton of bugs in version 555 (for instance switching back to Xorg from Linux console resulted in a dead system as the screen just turned black), but I may give it a try.

I also did not like options nvidia NVreg_PreserveVideoMemoryAllocations=1, maybe it's time to try it again.

I'm using only this:

options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia-drm modeset=1
birdie-github commented 2 months ago

I've enabled options nvidia-drm modeset=1 fbdev=1 but that didn't help at all.

Let's try without modeset.

birdie-github commented 2 months ago

Nothing works. Still getting a ton of kernel oopses.

birdie-github commented 1 month ago

@aritger

Any updates? It's been quite a while.

aritger commented 1 month ago

This is supposed to be addressed in our upcoming 565.xx release, but I don't know when that is scheduled to be released. Thanks for your patience, and sorry for the delays.

tekstryder commented 1 month ago

There's no specific mention of this one in the list of fixes for the 565.57.01 beta release.

But, here's hoping!

birdie-github commented 1 month ago

This is not fixed for me in 565.57.01. The bug was filed four months ago :(

719

@aritger any idea why the fix hasn't found its way into this beta?

aritger commented 1 month ago

The fix I mentioned previously is included in 565.57.01. I suspect #719 is a different bug with a similar symptom; I'll follow up there. I apologize for the continued problems. As always, the best thing that will help is to capture a full nvidia-bug-report.log.gz, so that we have all the relevant information about the system configuration.

avoiceofreason commented 3 weeks ago

For info. I'm on nvidia driver version 555.58.02

After a lot of trial, error and luck, I commented out the fbdev=1 line in the /etc/modprobe.d/nvidia-graphics-drivers-kms.conf file and resume started to work again. Oh and don't forget to sudo update-initramfs -u and reboot first of course......

options nvidia-drm modeset=1

options nvidia-drm fbdev=1

options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp