NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
14.19k stars 1.17k forks source link

DRM cause core dump & NVRM dmesg ERROR #489

Open paorie opened 1 year ago

paorie commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

nvidia-open-dkms 530.41.03-3

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Arch Linux

Kernel Release

6.2.10-zen

Hardware: GPU

NVIDIA GeForce RTX 3060 Laptop GPU

Describe the bug

Machine: Alienware M15R7

Launching wayland session causes core dump. I've tryed from Plasma and Hyprland with nvidia-drm as backend. Also there is a strange error message in dmesg saying NVRM objClInitPcieChipset: *** Chipset Setup Function Error! and one on journalctl saying nvidia: module verification failed: signature and/or required key missing - tainting kernel by loging the boot the module seems to be loaded, but when i start session it causes core-dump. If i disable DRM backend hyperland wayland session starts with nvidia driver. No chance for kwin.

dmesg:

[    0.000000] BIOS-e820: [mem 0x0000000060d11000-0x0000000061571fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000060d11000-0x0000000061571fff] ACPI NVS
[    0.136418] ACPI: PM: Registering ACPI NVS region [mem 0x60d11000-0x61571fff] (8785920 bytes)
[    0.255264] ACPI: \_SB_.PC00.CNVW.WRST: New power resource
[    2.757687] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
[    2.783117] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
[    2.870841] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[    6.933352] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input13
[    6.933604] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input14
[    6.968959] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input15
[    6.968996] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input16
[    7.111300] iwlwifi 0000:00:14.3: CNVI_SCU_SEQ_DATA_DW9: 0x10
[   67.112494] Asynchronous wait on fence NVIDIA:nvidia.prime:0 timed out (hint:submit_notify [i915])

journactl --grep "nvidia":

Apr 11 15:28:37 archalien kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=PARTUUID=f2b1acd4-dfa1-4a46-9f19-31b6c2489e5d zswap.e>
Apr 11 15:28:37 archalien kernel: Kernel command line: initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=PARTUUID=f2b1acd4-dfa1-4a46-9f19-31b6c2489e5d >
Apr 11 15:28:37 archalien kernel: nvidia: loading out-of-tree module taints kernel.
Apr 11 15:28:37 archalien kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Apr 11 15:28:37 archalien kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Apr 11 15:28:37 archalien kernel: nvidia 0000:01:00.0: enabling device (0006 -> 0007)
Apr 11 15:28:37 archalien kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Apr 11 15:28:37 archalien kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
Apr 11 15:28:37 archalien kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (archlinux-builder@arc>
Apr 11 15:28:37 archalien kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
Apr 11 15:28:37 archalien kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Apr 11 15:28:37 archalien kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Apr 11 15:28:37 archalien systemd[1]: Starting Load/Save Screen Backlight Brightness of backlight:nvidia_wmi_ec_backlight...
Apr 11 15:28:37 archalien systemd[1]: Finished Load/Save Screen Backlight Brightness of backlight:nvidia_wmi_ec_backlight.
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input13
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input14
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input15
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input16
Apr 11 15:28:38 archalien systemd[1]: Starting NVIDIA Persistence Daemon...
Apr 11 15:28:38 archalien systemd[1]: Starting nvidia-powerd service...
Apr 11 15:28:38 archalien /usr/bin/nvidia-powerd[894]: nvidia-powerd version:1.0(build 1)
Apr 11 15:28:38 archalien systemd[1]: Started NVIDIA Persistence Daemon.
Apr 11 15:28:38 archalien systemd[1]: nvidia-powerd.service: Main process exited, code=exited, status=1/FAILURE
Apr 11 15:28:38 archalien systemd[1]: nvidia-powerd.service: Failed with result 'exit-code'.
Apr 11 15:28:38 archalien systemd[1]: Failed to start nvidia-powerd service.
Apr 11 15:28:47 archalien systemd-coredump[2185]: [🡕] Process 2119 (Hyprland) of user 1000 dumped core.

                                                  Stack trace of thread 2119:
                                                  #0  0x000055ddb5ac99e8 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x1939e8)
                                                  #1  0x000055ddb5a31aee _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xfbaee)

journactl --grep "kwin":

Apr 11 12:45:51 archalien systemd[2214]: plasma-kwin_wayland.service: Consumed 1.174s CPU time.
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_xkbcommon: XKB: inet:323:58: unrecognized keysym "XF86EmojiPicker"
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_xkbcommon: XKB: inet:324:58: unrecognized keysym "XF86Dictate"
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_core: Parse error in tiles configuration for monitor "ada15eeb-9ed6-5738-a180-bd9fe2361632" : "illegal value" Cre>
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_core: Parse error in tiles configuration for monitor "0d3998b5-12fb-5e5d-9844-298a9a2f96a3" : "illegal value" Cre>
Apr 11 12:46:18 archalien kwin_x11[4178]: kwin_platform_x11_standalone: QOpenGLContext::globalShareContext() is required
Apr 11 12:46:18 archalien kwin_x11[4178]: kwin_scene_opengl: Creating the OpenGL rendering failed:  "Could not initialize rendering context"
Apr 11 12:51:06 archalien systemd[3951]: plasma-kwin_x11.service: Consumed 3.876s CPU time.
Apr 11 12:51:15 archalien kernel: kwin_wayland[9195]: segfault at 0 ip 00007fd0eb81556b sp 00007ffdb36d6ec0 error 4 in libnvidia-allocator.so.530.41.03[7fd0eb80>
Apr 11 12:51:16 archalien systemd-coredump[9350]: [🡕] Process 9195 (kwin_wayland) of user 1000 dumped core.

                                                  Stack trace of thread 9195:
                                                  #0  0x00007fd0eb81556b n/a (nvidia-drm_gbm.so + 0x1556b)
                                                  #1  0x00007fd0eb815838 n/a (nvidia-drm_gbm.so + 0x15838)
                                                  #2  0x00007fd0f805ce59 n/a (libgbm.so.1 + 0x4e59)
                                                  #3  0x00007fd0f805eab1 gbm_create_device (libgbm.so.1 + 0x6ab1)
                                                  #4  0x00007fd0fb161e74 _ZN4KWin10DrmBackend6addGpuERK7QString (libkwin.so.5 + 0x361e74)
                                                  #5  0x00007fd0fb15ef1b _ZN4KWin10DrmBackend10initializeEv (libkwin.so.5 + 0x35ef1b)
                                                  #6  0x0000561b851f1315 n/a (kwin_wayland + 0x5a315)
                                                  #7  0x0000561b851e723c n/a (kwin_wayland + 0x5023c)
                                                  #8  0x00007fd0f863c790 n/a (libc.so.6 + 0x23790)
                                                  #9  0x00007fd0f863c84a __libc_start_main (libc.so.6 + 0x2384a)
                                                  #10 0x0000561b851e8e95 n/a (kwin_wayland + 0x51e95)

journalctl --grep "hyprland"

❯ journalctl --grep "hyprland"
Apr 08 10:14:22 archalien sddm-helper[2192]: Starting Wayland user session: "/usr/share/sddm/scripts/wayland-session" "Hyprland"
Apr 08 10:14:23 archalien kernel: Hyprland[2208]: segfault at 10 ip 0000557959c00b28 sp 00007fff9c518740 error 4 in Hyprland[557959ad1000+15b000] likely on CPU 8 >
Apr 08 10:14:23 archalien systemd-coredump[2249]: [🡕] Process 2208 (Hyprland) of user 1000 dumped core.

                                                  Stack trace of thread 2208:
                                                  #0  0x0000557959c00b28 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x18cb28)
                                                  #1  0x0000557959b6b13e _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xf713e)
                                                  #2  0x0000557959b07f3c _Z25handleUnrecoverableSignali (Hyprland + 0x93f3c)
                                                  #3  0x00007f9c6fb69f50 n/a (libc.so.6 + 0x38f50)
                                                  #4  0x00007f9c6fbb88ec n/a (libc.so.6 + 0x878ec)
                                                  #5  0x00007f9c6fb69ea8 raise (libc.so.6 + 0x38ea8)
                                                  #6  0x00007f9c6fb5353d abort (libc.so.6 + 0x2253d)
                                                  #7  0x00007f9c6fe9a833 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6 + 0x9a833)
                                                  #8  0x00007f9c6fea6d0c _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xa6d0c)
                                                  #9  0x00007f9c6fea6d79 _ZSt9terminatev (libstdc++.so.6 + 0xa6d79)
                                                  #10 0x00007f9c6fea6fdd __cxa_throw (libstdc++.so.6 + 0xa6fdd)
                                                  #11 0x0000557959ad5a74 _ZN11CCompositor10initServerEv.cold (Hyprland + 0x61a74)
                                                  #12 0x0000557959afaa2b main (Hyprland + 0x86a2b)
                                                  #13 0x00007f9c6fb54790 n/a (libc.so.6 + 0x23790)
                                                  #14 0x00007f9c6fb5484a __libc_start_main (libc.so.6 + 0x2384a)
                                                  #15 0x0000557959b07e05 _start (Hyprland + 0x93e05)
                                                  ELF object binary architecture: AMD x86-64
Apr 08 10:14:24 archalien sddm-greeter[2278]: Reading from "/usr/local/share/wayland-sessions/hyprland.desktop"
Apr 08 10:14:24 archalien sddm-greeter[2278]: Reading from "/usr/share/wayland-sessions/hyprland.desktop"
Apr 08 10:14:50 archalien kernel: Hyprland[2970]: segfault at 10 ip 0000562e52fe1b28 sp 00007ffe87d89100 error 4 in Hyprland[562e52eb2000+15b000] likely on CPU 6 >
Apr 08 10:14:50 archalien systemd-coredump[2987]: [🡕] Process 2970 (Hyprland) of user 1000 dumped core.

                                                  Stack trace of thread 2970:
                                                  #0  0x0000562e52fe1b28 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x18cb28)
                                                  #1  0x0000562e52f4c13e _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xf713e)

nvidia-smi -q:

==============NVSMI LOG==============

Timestamp                                 : Tue Apr 11 16:10:42 2023
Driver Version                            : 530.41.03
CUDA Version                              : 12.1

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3060 Laptop GPU
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-3c621dcd-20d4-109f-7874-ee23c382942e
    Minor Number                          : 0
    VBIOS Version                         : 94.06.29.00.35
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2560-775-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 530.41.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x256010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x0B541028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 1000 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 6144 MiB
        Reserved                          : 366 MiB
        Used                              : 195 MiB
        Free                              : 5582 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 8 MiB
        Free                              : 8184 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 4 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 52 C
        GPU Shutdown Temp                 : 105 C
        GPU Slowdown Temp                 : 102 C
        GPU Max Operating Temp            : 87 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 17.97 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 643.750 mV
    Fabric
        State                             : N/A
        Status                            : N/A

To Reproduce

Enable drm and try to start wayland session. Errors in dmesg appears every boot.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

gilvbp commented 1 year ago

Use the 525.105.17 version!

Kaoticz commented 1 year ago

Use the 525.105.17 version!

The same issue occurs in that version. Apparently it happens when you log in with a monitor refresh rate higher than 60Hz. After you log in, changing from 60Hz to a higher frequency is fine. The problem only occurs when you try to log in with a frequency higher than 60Hz.

I'm using a GTX 1070. I'd downgrade to version 525.89.02 as that version didn't cause me any issues, but I'm having trouble compiling it with the current kernel version 6.3.1.

gilvbp commented 1 year ago

@Kaoticz try the new driver 525.116.04, I'm using, so far so good.