NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.24k stars 1.29k forks source link

6.12: drm_open_helper RIP #712

Open ptr1337 opened 1 month ago

ptr1337 commented 1 month ago

NVIDIA Open GPU Kernel Modules Version

ed4be649623435ebb04f5e93f859bf46d977daa4

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

CachyOS (ArchLinux)

Kernel Release

6.12.0rc1

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4070 SUPER (UUID: GPU-8c5baf85-cb1f-fe26-95d5-ff3fd51249bb)

Describe the bug

Since the 6.12.0rc1 Release the kernel drm-helper is crashing with the 560.35.03 drivers.

Following patches were pulled in, to make the driver compatible with 6.12, these were extracted out of the 550.120 release: drm_fbdev fixup for 6.11+: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0004-6.11-Add-fix-for-fbdev.patch drm_outpull_pill for 6.12: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0005-6.12-drm_outpull_pill-changed-check.patch

Additional patch to make the module compilation happy (Introduced in commit https://github.com/torvalds/linux/commit/32f51ead3d7771cdec29f75e08d50a76d2c6253d ):

diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
                 continue;
             }

-            if (PageSwapCache(src_page)) {
+            if (folio_test_swapcache(page_folio(src_page))) {
                 // TODO: Bug 4050579: Remove this when swap cached pages can be
                 // migrated.
                 status = NV_WARN_MISMATCHED_TARGET;

with these patches the DKMS Compilation is successful and the driver works fine with the 6.11.x kernel.

Booting into 6.12.0rc1 results into that the driver crashes, at drm_open_helper and there is graphical interface available anymore. The tty is working fine. Following is visible in the dmesg log:

[    5.090174] Console: switching to colour frame buffer device 240x67
[    5.090176] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
[    5.096243] ------------[ cut here ]------------
[    5.096244] WARNING: CPU: 0 PID: 453 at drivers/gpu/drm/drm_file.c:312 drm_open_helper+0x135/0x150
[    5.096249] Modules linked in: nvidia_uvm(OE) nvidia_drm(OE) drm_ttm_helper btrfs ttm blake2b_generic nvidia_modeset(OE) libcrc32c crc32c_generic xor hid_generic raid6_pq nvme nvme_core crc32c_intel video sha256_ssse3 usbhid nvme_auth wmi nvidia(OE)
[    5.096255] CPU: 0 UID: 0 PID: 453 Comm: plymouthd Tainted: G           OE      6.12.0-rc1-1-cachyos-rc #1 12df37afa12b373ced2670803975698fbda2ce5d
[    5.096257] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[    5.096257] Hardware name: ASRock X670E Pro RS/X670E Pro RS, BIOS 3.08 09/18/2024
[    5.096258] RIP: 0010:drm_open_helper+0x135/0x150
[    5.096259] Code: 5d 41 5c c3 cc cc cc cc 48 89 df e8 c5 82 fe ff 85 c0 0f 84 7a ff ff ff 48 89 df 89 44 24 0c e8 c1 f9 ff ff 8b 44 24 0c eb d1 <0f> 0b b8 ea ff ff ff eb c8 b8 ea ff ff ff eb c1 b8 f0 ff ff ff eb
[    5.096260] RSP: 0018:ffffa643409ffb20 EFLAGS: 00010246
[    5.096261] RAX: ffffffffc15df380 RBX: ffff89f744740f28 RCX: 0000000000000000
[    5.096262] RDX: ffff89f755ee0000 RSI: ffff89f744740f28 RDI: ffff89f74df1cd80
[    5.096262] RBP: ffff89f74df1cd80 R08: 0000000000000006 R09: ffff89f740213cd0
[    5.096263] R10: 00000000000000e2 R11: 0000000000000002 R12: ffff89f75735a000
[    5.096263] R13: ffffffffc15df380 R14: 00000000ffffffed R15: ffffa643409ffe1c
[    5.096264] FS:  00007f6b595ce480(0000) GS:ffff8a065ce00000(0000) knlGS:0000000000000000
[    5.096264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.096265] CR2: 000055da04c46558 CR3: 000000010d18c000 CR4: 0000000000f50ef0
[    5.096265] PKRU: 55555554
[    5.096266] Call Trace:
[    5.096267]  <TASK>
[    5.096267]  ? drm_open_helper+0x135/0x150
[    5.096268]  ? __warn.cold+0xad/0x116
[    5.096270]  ? drm_open_helper+0x135/0x150
[    5.096272]  ? report_bug+0x127/0x170
[    5.096273]  ? handle_bug+0x58/0x90
[    5.096275]  ? exc_invalid_op+0x1b/0x80
[    5.096276]  ? asm_exc_invalid_op+0x1a/0x20
[    5.096279]  ? drm_open_helper+0x135/0x150
[    5.096279]  drm_open+0x81/0x110
[    5.096280]  drm_stub_open+0xaf/0x100
[    5.096282]  chrdev_open+0xc5/0x260
[    5.096285]  ? __pfx_chrdev_open+0x10/0x10
[    5.096286]  do_dentry_open+0x14b/0x490
[    5.096287]  vfs_open+0x30/0xe0
[    5.096289]  path_openat+0x84d/0x1320
[    5.096290]  ? __alloc_pages_noprof+0x183/0x350
[    5.096292]  do_filp_open+0xd2/0x180
[    5.096293]  do_sys_openat2+0xca/0x100
[    5.096294]  __x64_sys_openat+0x55/0xa0
[    5.096295]  do_syscall_64+0x82/0x190
[    5.096296]  ? handle_mm_fault+0x1d9/0x2e0
[    5.096297]  ? do_user_addr_fault+0x38d/0x6c0
[    5.096299]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    5.096300] RIP: 0033:0x7f6b59899ae5
[    5.096301] Code: 75 53 89 f0 f7 d0 a9 00 00 41 00 74 48 80 3d d1 b5 0d 00 00 74 6c 45 89 e2 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8f 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[    5.096302] RSP: 002b:00007fffbdc08760 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[    5.096303] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6b59899ae5
[    5.096303] RDX: 0000000000000002 RSI: 000055da04c42a40 RDI: 00000000ffffff9c
[    5.096303] RBP: 000055da04c42a40 R08: 0000000000000000 R09: 0000000000000007
[    5.096304] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[    5.096304] R13: 00007f6b599a1a50 R14: 000000000000000b R15: 000055da04c43e30
[    5.096305]  </TASK>
[    5.096305] ---[ end trace 0000000000000000 ]---
[    5.173332] systemd-journald[355]: Received SIGTERM from PID 1 (systemd).

To Reproduce

  1. Compile 6.12.0.rc1 Kernel
  2. Apply above mentioned patches on 560.35.03
  3. Compile the Module and boot into

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

philmmanjaro commented 1 month ago

What happens if you revert that kernel change made by upstream. Made the drivers compile without additional patches: What happens if you revert that change in kernel. That is what I did before: https://gitlab.manjaro.org/packages/core/linux612/-/blob/ec1f53f77fd3f92f7cd4eeed444a341d8ded3291/revert-nvidia-446d0f48.patch

mtijanic commented 1 month ago

Thanks! Tracked internally as NV bug 4888621.

joanbm commented 1 month ago

This may be related to commit 641bb4394f40 ("fs: move FMODE_UNSIGNED_OFFSET to fop_flags"). At least for nvidia-470xx it's fixed by adding the .fop_flags = FOP_UNSIGNED_OFFSET line from this patch. Though for me the kernel didn't full crash, just fail to detect the adapters correctly.

ptr1337 commented 1 month ago

@joanbm It seems this patch does work and I got properly on 6.12 into the kernel. There was one more patch required to have a succesful dkms compilation, due upstream changes:

diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
                 continue;
             }

-            if (PageSwapCache(src_page)) {
+            if (folio_test_swapcache(page_folio(src_page))) {
                 // TODO: Bug 4050579: Remove this when swap cached pages can be
                 // migrated.
                 status = NV_WARN_MISMATCHED_TARGET;

Commit: https://github.com/CachyOS/CachyOS-PKGBUILDS/commit/3352d048906d755e6b49d0eee5bb86766db99bd2

Binary-Eater commented 1 week ago