NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.19k stars 1.28k forks source link

Nvidia 560.28.03-1 throwing kernel stack trace with linux kernels from 6.10.3 up to 6.10.9 or newer #705

Open mashu opened 1 month ago

mashu commented 1 month ago

NVIDIA Open GPU Kernel Modules Version

560.35.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Debian GNU/Linux trixie/sid

Kernel Release

6.10.9

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 4090 Laptop GPU

Describe the bug

I am getting lots of errors and kernel tainted with stack in dmesg with latest nvidia driver 560.28.03-1 and linux kernel 6.10.3 (for full log see nvidia-bug-report.log.gz included in this report) on GNU/Linux Debian setup.

Short summary:

  1. The error messages are consistently related to the function follow_pte+0x1de/0x200.
  2. In the call traces, we can see NVIDIA-related functions being called: nv_revoke_gpu_mappings+0x67/0xb0 [nvidia] RmHandleIdleSustained+0x39/0x130 [nvidia] rm_execute_work_item+0xe0/0x150 [nvidia] 3.The module list shows NVIDIA modules loaded: nvidia_uvm(OE) nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) The (OE) suffix likely indicates these are out-of-tree (externally built) modules and NVIDIA is the only OE module I have.
  3. The error is occurring in a kernel thread named "nv_queue", which is likely an NVIDIA driver thread.
  4. The warnings are being triggered at include/linux/rwsem.h:80, which suggests there might be an issue with how the NVIDIA driver is handling read-write semaphores in the kernel.

To Reproduce

Boot 6.10.9 kernel with latest official nvidia driver and check dmesg logs.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

Above nvidia-bug-report.log.gz includes this but also pasting here for convinience

[   50.485511] CPU: 14 PID: 1229 Comm: nv_queue Tainted: G        W  OE      6.10.9-amd64 #1  Debian 6.10.9-1
[   50.485511] Hardware name: LENOVO 83AG/LNVNB161216, BIOS MHCN42WW 03/25/2024
[   50.485511] RIP: 0010:follow_pte+0x20b/0x220
[   50.485512] Code: 00 00 00 c0 eb 8b 49 8b 3c 24 e8 00 bf 91 00 e8 bb 5e e1 ff bd ea ff ff ff 5b 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc <0f> 0b e9 1e fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[   50.485513] RSP: 0018:ffffab47c10afb60 EFLAGS: 00010246
[   50.485513] RAX: 0000000000000000 RBX: 00007fcbe8b8e000 RCX: ffffab47c10afba0
[   50.485514] RDX: ffffab47c10afb98 RSI: 00007fcbe8b8e000 RDI: ffff9c8870728e70
[   50.485514] RBP: ffffab47c10afbe0 R08: ffffab47c10afd38 R09: 0000000000000000
[   50.485515] R10: 000000008040003c R11: 0000000000000000 R12: ffffab47c10afba0
[   50.485515] R13: ffffab47c10afb98 R14: ffff9c8874afb180 R15: 0000000000000000
[   50.485516] FS:  0000000000000000(0000) GS:ffff9c97b3300000(0000) knlGS:0000000000000000
[   50.485516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.485517] CR2: 00007f80c9c7b6b4 CR3: 00000001a2ce6000 CR4: 0000000000f50ef0
[   50.485517] PKRU: 55555554
[   50.485517] Call Trace:
[   50.485518]  <TASK>
[   50.485518]  ? __warn+0x80/0x120
[   50.485519]  ? follow_pte+0x20b/0x220
[   50.485520]  ? report_bug+0x164/0x190
[   50.485521]  ? handle_bug+0x3c/0x80
[   50.485522]  ? exc_invalid_op+0x17/0x70
[   50.485523]  ? asm_exc_invalid_op+0x1a/0x20
[   50.485524]  ? follow_pte+0x20b/0x220
[   50.485525]  follow_phys+0x4b/0x110
[   50.485526]  untrack_pfn+0x57/0x120
[   50.485528]  unmap_single_vma+0xa6/0xe0
[   50.485529]  zap_page_range_single+0x122/0x1d0
[   50.485530]  unmap_mapping_range+0x111/0x140
[   50.485532]  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia]
[   50.485584]  RmHandleIdleSustained+0x39/0x130 [nvidia]
[   50.485678]  ? gpumgrGetGpu+0x69/0xa0 [nvidia]
[   50.485781]  rm_execute_work_item+0xe0/0x150 [nvidia]
[   50.485882]  ? os_execute_work_item+0x19/0x80 [nvidia]
[   50.485934]  _main_loop+0x8f/0x150 [nvidia]
[   50.485991]  ? __pfx__main_loop+0x10/0x10 [nvidia]
[   50.486046]  kthread+0xcf/0x100
[   50.486048]  ? __pfx_kthread+0x10/0x10
[   50.486049]  ret_from_fork+0x31/0x50
[   50.486049]  ? __pfx_kthread+0x10/0x10
[   50.486050]  ret_from_fork_asm+0x1a/0x30
[   50.486051]  </TASK>
[   50.486052] ---[ end trace 0000000000000000 ]---

More Info

No response

Tarballwalf commented 1 month ago

due to this, it seems that on my end x11/xwayland has stopped working. cannot even launch any proton games on xwayland.

veldenb commented 1 month ago

Same here on 6.11 kernel when I try to enter sleep mode on wayland, Ubuntu 24.10 beta:

2024-09-20T22:59:50.819111+02:00 bernard-desktop kernel: CPU: 27 UID: 0 PID: 15484 Comm: nvidia-sleep.sh Kdump: loaded Tainted: G           OE      6.11.0-7-generic #7-Ubuntu
2024-09-20T22:59:50.819112+02:00 bernard-desktop kernel: Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
2024-09-20T22:59:50.819112+02:00 bernard-desktop kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII DARK HERO, BIOS 3801 07/30/2021
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: RIP: 0010:follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: Code: 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 30 e8 1c e4 ff ff 48 8b 15 d5 28 92 01 48 81 e2 00 00 00 c0 e9 7b ff ff ff <0f> 0b e9 56 fe 
ff ff 48 8b 45 d0 48 8b 38 e8 46 03 e9 00 e8 31 be
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: RSP: 0018:ffffb0bb0708f770 EFLAGS: 00010246
2024-09-20T22:59:50.819114+02:00 bernard-desktop kernel: RAX: 0000000000000000 RBX: 0000713de4a06000 RCX: ffffb0bb0708f7c0
2024-09-20T22:59:50.819114+02:00 bernard-desktop kernel: RDX: ffffb0bb0708f7b8 RSI: 0000713de4a06000 RDI: ffff9077da98a398
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: RBP: ffffb0bb0708f7a8 R08: ffffb0bb0708f978 R09: 0000000000000000
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb0bb0708f808
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: R13: 0000000000000000 R14: ffffb0bb0708f7b8 R15: ffff9077d24c9080
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: FS:  00007d42cce13740(0000) GS:ffff907ecef80000(0000) knlGS:0000000000000000
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: CR2: 0000000086d79000 CR3: 0000000110136000 CR4: 0000000000f50ef0
2024-09-20T22:59:50.819117+02:00 bernard-desktop kernel: PKRU: 55555554
2024-09-20T22:59:50.819117+02:00 bernard-desktop kernel: Call Trace:
2024-09-20T22:59:50.819125+02:00 bernard-desktop kernel:  <TASK>
2024-09-20T22:59:50.819125+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819126+02:00 bernard-desktop kernel:  ? show_trace_log_lvl+0x273/0x310
2024-09-20T22:59:50.819126+02:00 bernard-desktop kernel:  ? show_trace_log_lvl+0x273/0x310
2024-09-20T22:59:50.819128+02:00 bernard-desktop kernel:  ? follow_phys+0x4c/0x110
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? show_regs.part.0+0x22/0x30
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? show_regs.cold+0x8/0x10
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? __warn.cold+0xa7/0x101
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? report_bug+0x114/0x160
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? handle_bug+0x51/0xa0
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? exc_invalid_op+0x18/0x80
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? asm_exc_invalid_op+0x1b/0x20
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  follow_phys+0x4c/0x110
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  untrack_pfn+0x55/0x130
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  unmap_single_vma+0xbc/0xf0
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  zap_page_range_single+0x138/0x210
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  unmap_mapping_range+0x119/0x140
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  nv_revoke_gpu_mappings_locked+0x46/0x80 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  nv_set_system_power_state+0x1d6/0x480 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  nv_procfs_write_suspend+0x102/0x1b0 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  proc_reg_write+0x6c/0xb0
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  vfs_write+0x107/0x490
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  ksys_write+0x71/0x100
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  __x64_sys_write+0x19/0x30
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  x64_sys_call+0x7e/0x22b0
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  do_syscall_64+0x7e/0x170
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819137+02:00 bernard-desktop kernel:  ? __do_sys_newfstat+0x76/0x80
2024-09-20T22:59:50.819159+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? do_syscall_64+0x8a/0x170
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819161+02:00 bernard-desktop kernel:  ? filp_flush+0x57/0x90
2024-09-20T22:59:50.819161+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819162+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819162+02:00 bernard-desktop kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? do_syscall_64+0x8a/0x170
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819169+02:00 bernard-desktop kernel:  ? irqentry_exit_to_user_mode+0x43/0x250
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? irqentry_exit+0x43/0x50
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel:  ? exc_page_fault+0x96/0x1c0
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel: RIP: 0033:0x7d42ccb26274
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d f5 2d 0f 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: RSP: 002b:00007ffef22725d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007d42ccb26274
2024-09-20T22:59:50.819173+02:00 bernard-desktop kernel: RDX: 0000000000000008 RSI: 00005f6c7f01e520 RDI: 0000000000000001
2024-09-20T22:59:50.819173+02:00 bernard-desktop kernel: RBP: 00007ffef2272600 R08: 0000000000000000 R09: 0000000000000001
2024-09-20T22:59:50.819174+02:00 bernard-desktop kernel: R10: 00005f6c7f01e510 R11: 0000000000000202 R12: 0000000000000008
2024-09-20T22:59:50.819174+02:00 bernard-desktop kernel: R13: 00005f6c7f01e520 R14: 00007d42ccc125c0 R15: 00007d42ccc0fea0
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel:  </TASK>
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel: ---[ end trace 0000000000000000 ]---
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel: ------------[ cut here ]------------

nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:0B:00.0  On |                  N/A |
|  0%   37C    P8             18W /  320W |     535MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
birdie-github commented 1 month ago

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

MaxKh commented 1 month ago

now I simply power off the PC entirely, since I got fed up with this.

Same here. 560.35.03 and earlier. Archlinux, GeForce RTX 3050 Ti Laptop

veldenb commented 1 month ago

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

Good to know, following #662 :)

mashu commented 1 month ago

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

It's important to keep issues distinct and avoid mislabeling them as duplicates without clear evidence. If you're experiencing suspend-related problems, it would be best to discuss those in a thread specifically addressing that issue, rather than here.

This bug report has totally different stack trace signature than #662 and original report didn't mention any suspend related issues.

Theluga commented 1 month ago

I don't know if the #662 is related but on the closed-source side, this error is already known by Nvidia Nvidia forum.

I have the same problem on the Arch Linux and my workaround was to use the Linux-LTS 6.6.52-1-lts temporarily.

The problem was posted on the Arch forums since july Arch Forum

I hope they fix this soon, because 6.10 is basically incompatible with Nvidia drivers open or not without errors.