NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
14.17k stars 1.17k forks source link

8K display crashing nvidia GPU, blank screen #484

Open dllu opened 1 year ago

dllu commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

530.41.03

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Arch Linux

Kernel Release

6.2.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 22 Mar 2023 22:52:35 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-80b59546-cdef-7dda-b99f-e4e532d3726d)

Describe the bug

Upon attempting to start an x server with startx, the display is blank (it says "Check Device Power")

When I try to SSH into the machine, the following lines are in the dmesg.

[   23.138228] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[   39.156353] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 12
[  158.666001] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1051
[  166.672768] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1060
[  174.679521] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1060
[  198.699702] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1060
[  206.706414] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:7:0:1060
[  214.713127] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1051
[  222.719824] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1060
[  230.726524] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1060
[  254.746614] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1060
[  262.753310] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:7:0:1060

When running nvidia-smi, it hangs for a long time, during which the following lines are printed in dmesg:

[  479.022059] nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
[  479.022626] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
[  491.488862] INFO: task nvidia-modeset/:285 blocked for more than 122 seconds.
[  491.488866]       Tainted: P           OE      6.2.8-arch1-1 #1
[  491.488868] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  491.488869] task:nvidia-modeset/ state:D stack:0     pid:285   ppid:2      flags:0x00004000
[  491.488874] Call Trace:
[  491.488875]  <TASK>
[  491.488877]  __schedule+0x3c8/0x12e0
[  491.488883]  ? __schedule+0x3d0/0x12e0
[  491.488888]  schedule+0x5e/0xd0
[  491.488890]  schedule_timeout+0x151/0x160
[  491.488895]  __down_common+0x11a/0x230
[  491.488899]  down+0x47/0x60
[  491.488902]  nvkms_kthread_q_callback+0x9b/0x130 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[  491.488943]  _main_loop+0x93/0x150 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[  491.488982]  ? __pfx__main_loop+0x10/0x10 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[  491.489023]  kthread+0xde/0x110
[  491.489026]  ? __pfx_kthread+0x10/0x10
[  491.489029]  ret_from_fork+0x2c/0x50
[  491.489035]  </TASK>

However, after a few minutes, nvidia-smi successfully runs.

Running nvidia-bug-report.sh also hangs, during which the following are printed out in dmesg:

[ 1010.499964] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device HDMI-0
[ 1010.499995] BUG: kernel NULL pointer dereference, address: 000000000000022c
[ 1010.500015] #PF: supervisor read access in kernel mode
[ 1010.500027] #PF: error_code(0x0000) - not-present page
[ 1010.500038] PGD 0 P4D 0
[ 1010.500045] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1010.500055] CPU: 17 PID: 285 Comm: nvidia-modeset/ Tainted: P           OE      6.2.8-arch1-1 #1 97507a85a20085e4c7bd7722b8899840c7d0bffd
[ 1010.500084] Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 4002 06/15/2021
[ 1010.500104] RIP: 0010:_nv002252kms+0xa9/0x1c0 [nvidia_modeset]
[ 1010.500137] Code: c6 43 0c 00 89 43 08 49 8b 85 58 0a 00 00 48 8b 80 98 01 00 00 48 85 c0 74 a7 4c 89 ef e8 af d5 fe f0 b9 04 00 00 00 48 63 f1 <8b> 7c f0 08 85 ff 74 5f 48 8d 04 f0 0f b7 70 08 66 89 73 04 0f b7
[ 1010.500177] RSP: 0018:ffffaf9ec12fb5a0 EFLAGS: 00010202
[ 1010.500189] RAX: 0000000000000204 RBX: ffff90f1d6074188 RCX: 0000000000000004
[ 1010.500206] RDX: 0000000001e00280 RSI: 0000000000000004 RDI: ffff90f1d632e008
[ 1010.500222] RBP: ffffaf9ec12fb5e0 R08: 0000000000000400 R09: 0000000000000400
[ 1010.500237] R10: 0000000000000000 R11: 000000000000005f R12: ffff90f1d632e698
[ 1010.500253] R13: ffff90f1d632e008 R14: ffff90f1d6074170 R15: ffff90f1d632e7d8
[ 1010.500268] FS:  0000000000000000(0000) GS:ffff9100af040000(0000) knlGS:0000000000000000
[ 1010.500286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.500299] CR2: 000000000000022c CR3: 000000010f872000 CR4: 0000000000350ee0
[ 1010.500315] Call Trace:
[ 1010.500321]  <TASK>
[ 1010.500326]  _nv000077kms+0x13a/0x180 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500363]  _nv002272kms+0x28b/0x660 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500398]  ? vsnprintf+0x2cb/0x550
[ 1010.500410]  ? _nv002412kms+0xa4/0xc0 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500449]  _nv000733kms+0x13f/0x380 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500486]  ? _nv000733kms+0xe8/0x380 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500523]  ? _nv002261kms+0x38/0x40 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500561]  ? __update_load_avg_se+0x2b8/0x320
[ 1010.500573]  _nv002766kms+0x4ab/0x610 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500610]  ? _nv002766kms+0x46d/0x610 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500647]  ? __switch_to_asm+0x3e/0x80
[ 1010.500658]  ? finish_task_switch.isra.0+0x90/0x2d0
[ 1010.500671]  ? _nv000409kms+0x80/0x80 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500706]  _nv000735kms+0x67/0x90 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500741]  nvKmsIoctl+0xf9/0x270 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500776]  nvkms_ioctl_from_kapi+0x6b/0xc0 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500814]  _nv000393kms+0x7d/0x200 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.500855]  ? check_preempt_curr+0x61/0x70
[ 1010.500865]  nv_drm_connector_get_modes+0x9e/0x150 [nvidia_drm f3482743a3e3b1b5774ce7e844d4f7a1db749f85]
[ 1010.500891]  drm_helper_probe_single_connector_modes+0x1c8/0x520
[ 1010.500907]  nv_drm_output_poll_changed+0x89/0xd0 [nvidia_drm f3482743a3e3b1b5774ce7e844d4f7a1db749f85]
[ 1010.500932]  drm_kms_helper_hotplug_event+0x2a/0x40
[ 1010.500943]  nv_drm_event_callback+0x51/0x90 [nvidia_drm f3482743a3e3b1b5774ce7e844d4f7a1db749f85]
[ 1010.501533]  nvKmsKapiHandleEventQueueChange+0xa3/0xd0 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.502153]  _main_loop+0x93/0x150 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.502779]  ? __pfx__main_loop+0x10/0x10 [nvidia_modeset 9bc24184cede2bad48d296aa366c1be1adaa2d72]
[ 1010.503407]  kthread+0xde/0x110
[ 1010.504007]  ? __pfx_kthread+0x10/0x10
[ 1010.504606]  ret_from_fork+0x2c/0x50
[ 1010.505203]  </TASK>
[ 1010.505793] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter bridge cmac algif_hash algif_skcipher af_alg overlay bnep 8021q garp mrp stp nct6775 llc nct6775_core hwmon_vid joydev btusb btrtl btbcm btintel btmtk vboxnetflt(OE) vboxnetadp(OE) bluetooth hid_apple mousedev apple_mfi_fastcharge ecdh_generic vboxdrv(OE) nls_iso8859_1 vfat fat iwlmvm mac80211 libarc4 cdc_acm iwlwifi uas usb_storage r8169 cfg80211 realtek mdio_devres libphy intel_rapl_msr intel_rapl_common snd_hda_codec_realtek edac_mce_amd snd_hda_codec_hdmi snd_hda_codec_generic kvm_amd snd_hda_intel uvcvideo snd_intel_dspcfg kvm videobuf2_vmalloc snd_intel_sdw_acpi snd_usb_audio videobuf2_memops snd_hda_codec irqbypass videobuf2_v4l2 crct10dif_pclmul snd_usbmidi_lib snd_hda_core crc32_pclmul snd_rawmidi polyval_clmulni polyval_generic snd_hwdep snd_seq_device videodev gf128mul eeepc_wmi
[ 1010.505831]  snd_pcm videobuf2_common ghash_clmulni_intel sha512_ssse3 asus_wmi snd_timer usbhid mc aesni_intel sp5100_tco ledtrig_audio sparse_keymap crypto_simd cryptd snd platform_profile rapl ccp soundcore rfkill pcspkr k10temp i2c_piix4 wmi_bmof acpi_cpufreq mac_hid usbip_host usbip_core dm_multipath crypto_user fuse dm_mod loop bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 nvme crc32c_intel nvme_core xhci_pci nvme_common xhci_pci_renesas nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) video wmi nvidia(POE)
[ 1010.514320] CR2: 000000000000022c
[ 1010.515019] ---[ end trace 0000000000000000 ]---
[ 1010.515715] RIP: 0010:_nv002252kms+0xa9/0x1c0 [nvidia_modeset]
[ 1010.516434] Code: c6 43 0c 00 89 43 08 49 8b 85 58 0a 00 00 48 8b 80 98 01 00 00 48 85 c0 74 a7 4c 89 ef e8 af d5 fe f0 b9 04 00 00 00 48 63 f1 <8b> 7c f0 08 85 ff 74 5f 48 8d 04 f0 0f b7 70 08 66 89 73 04 0f b7
[ 1010.517873] RSP: 0018:ffffaf9ec12fb5a0 EFLAGS: 00010202
[ 1010.518597] RAX: 0000000000000204 RBX: ffff90f1d6074188 RCX: 0000000000000004
[ 1010.519321] RDX: 0000000001e00280 RSI: 0000000000000004 RDI: ffff90f1d632e008
[ 1010.520038] RBP: ffffaf9ec12fb5e0 R08: 0000000000000400 R09: 0000000000000400
[ 1010.520759] R10: 0000000000000000 R11: 000000000000005f R12: ffff90f1d632e698
[ 1010.521472] R13: ffff90f1d632e008 R14: ffff90f1d6074170 R15: ffff90f1d632e7d8
[ 1010.522181] FS:  0000000000000000(0000) GS:ffff9100af040000(0000) knlGS:0000000000000000
[ 1010.522891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.523592] CR2: 000000000000022c CR3: 000000010f872000 CR4: 0000000000350ee0
[ 1010.524292] note: nvidia-modeset/[285] exited with irqs disabled

Please also find attached the Xorg.0.log which has a lot of useful information. Xorg.0.log

Nvidia drivers installed from arch linux repositories:

xorg.conf.d > pacman -Qi nvidia
Name            : nvidia-dkms
Version         : 530.41.03-1
Description     : NVIDIA drivers - module sources
Architecture    : x86_64
URL             : http://www.nvidia.com/
Licenses        : custom
Groups          : None
Provides        : NVIDIA-MODULE  nvidia
Depends On      : dkms  nvidia-utils=530.41.03  libglvnd
Optional Deps   : None
Required By     : None
Optional For    : None
Conflicts With  : NVIDIA-MODULE  nvidia
Replaces        : None
Installed Size  : 67.94 MiB
Packager        : Sven-Hendrik Haase <svenstaro@gmail.com>
Build Date      : Thu 23 Mar 2023 11:50:05 AM PDT
Install Date    : Sat 01 Apr 2023 10:03:48 AM PDT
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

Incidentally, 8K 60 Hz and 4K 120 Hz are both working fine in Windows (it is a dual boot machine).

Prior to upgrading nvidia drivers, 8K 30 Hz and 4K 120 Hz were working fine in Linux with the previous version (525.89.02).

To Reproduce

  1. Connect a PC with a NVIDIA GeForce RTX 3090 to a Samsung QN800A display using a high quality certified HDMI 2.1 cable.
  2. Install Arch Linux with nvidia driver version 530.41.03
  3. Set the Samsung QN800A to "Gaming" mode and verify that 8K 60 Hz and 4K 120 Hz are both working in Windows or another machine.
  4. place 20-nvidia-settings.conf in /usr/share/X11/xorg.conf.d 20-nvidia-settings.txt
  5. Upon starting X server with startx, a blank screen happens.

Bug Incidence

Always

nvidia-bug-report.log.gz

Since nvidia-bug-report.sh hangs, I had to run nvidia-bug-report.sh --safe-mode

nvidia-bug-report.log.gz

More Info

Downgrading back to 525.89.02 fixes the issue.

dfadev commented 1 year ago

QN800C and QN700B also have this issue. Turning off "game mode" and making sure "Input Signal Plus" is on should allow 8k@60 and 4k@119.88 to sync. Although then they'll fail to recover from a screen saver blanking..

dllu commented 1 year ago

That is a good hint. After turning off Game Mode, it worked, but after rebooting my computer, it stopped working again. I will debug a little further...

dfadev commented 11 months ago

535.54.03 seems to be better at syncing 8k@60 and 4k@120 on Samsung TVs but still has issues with DPMS.

This repro's every time for me on a dual TV setup:

  1. startx
  2. TVs sync (8K@60)
  3. hold power button on samsung remote to reboot TVs
  4. TVs reboot, fail to sync (No device connected)
  5. ALT-CTRL-F2 to switch to text console
  6. TVs sync (4k@60)
  7. ALT-F1 to switch back to X11
  8. screen blanks, kernel hard locks, hard reset cycle required (No device connected displayed on TV)

Also forcing dpms to off xset dpms force off and waiting some time before waking will sometimes lock the system requiring a hard reset.

mcsygit commented 9 months ago

Might have a +1, might be different. RTX 4090, LG C2 TV connected via HDMI 2.1. 4k120hz signal shows up only randomly. Problem's only on Linux but on both Wayland and X11. I have tried unplugging eARC soundbar, turning off VRR, no difference. 4k60Hz works always, but nothing above it consistenly. I did also try changing HDMI 2.1 cable.

Samsung LS28AG700N as secondary monitor works regardless using Displayport (4k144hz IPS), but I did try using just C2. C2 would lose VRR support if active DP adapter were to be used.

I would blame TV itself if it didn't work just fine on Windows.

Tried drivers from 525 to latest 535. No difference.

I know this is open source driver repo but something like this goes hand in hand with proprietary driver, right? I did only try proprietary driver.

EDIT: I think turning on new nvidia-modeset.hdmi_deepcolor=1 on 545 beta driver fixed my issue for good

ZombieLurker commented 3 days ago

This issue still happens with driver 550.90.07. All monitors and Samsung 4K 120Hz tv cuts signal after logging into Plasma 6.1 Wayland and 6.9.6-arch1-1. Unplugging the hdmi to the TV all of a sudden brings signal back to the other two monitors. What is weird, I didn't have this issue for a couple weeks but then all of a sudden after updating to KDE Plasma 6.1 and rebooting, this happened again.