NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.87k stars 13.93k forks source link

amdgpu: drm:amdgpu_device_ip_resume_phase2 fails on suspend/resume #287586

Open pfzetto opened 8 months ago

pfzetto commented 8 months ago

Describe the bug

Hello, when resuming my PC from suspend, I get a blackscreen or the last content of the screen without anything else (no response to user input, no network, no logs into rsyslog). I came to the conclusion that the kernel might crash on resume and started investigating. First I tried the steps described in https://wiki.ubuntu.com/DebuggingKernelSuspend and https://www.kernel.org/doc/html/latest/power/basic-pm-debugging.html, but couldn't compile the Kernel with PM_TRACE. Looking for other options I tried suspend-to-idle which produced the same error (blackscreen) and logged to rsyslog.

[ 9. Feb 22:12] PM: suspend entry (s2idle)
[  +0,017653] Filesystems sync: 0.017 seconds
[  +0,007187] Freezing user space processes
[  +0,000922] Freezing user space processes completed (elapsed 0.000 seconds)
[  +0,000003] OOM killer disabled.
[  +0,000001] Freezing remaining freezable tasks
[  +0,000982] Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
[  +0,000001] printk: Suspending console(s) (use no_console_suspend to debug)
[  +0,091369] serial 00:04: disabled
[ +22,132849] xhci_hcd 0000:02:00.0: xHC error in resume, USBSTS 0x401, Reinit
[  +0,000006] usb usb1: root hub lost power or was reset
[  +0,000002] usb usb2: root hub lost power or was reset
[  +0,000494] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  +0,000028] [drm] PSP is resuming...
[  +0,000234] serial 00:04: activated
[  +0,033313] [drm] reserve 0xa00000 from 0x81fd000000 for PSP TMR
[  +0,025821] nvme nvme0: Shutdown timeout set to 8 seconds
[  +0,027229] nvme nvme0: 32/0/0 default/read/poll queues
[  +0,069785] amdgpu 0000:2d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  +0,021269] amdgpu 0000:2d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  +0,000002] amdgpu 0000:2d:00.0: amdgpu: SMU is resuming...
[  +0,000004] amdgpu 0000:2d:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2f00 (59.47.0)
[  +0,000003] amdgpu 0000:2d:00.0: amdgpu: SMU driver if version not matched
[  +0,000049] amdgpu 0000:2d:00.0: amdgpu: use vbios provided pptable
[  +0,129134] ata2: SATA link down (SStatus 0 SControl 330)
[  +0,000357] ata1: SATA link down (SStatus 0 SControl 330)
[  +0,000026] ata5: SATA link down (SStatus 0 SControl 330)
[  +0,000021] ata4: SATA link down (SStatus 0 SControl 330)
[  +0,000071] ata6: SATA link down (SStatus 0 SControl 330)
[  +0,000021] ata3: SATA link down (SStatus 0 SControl 330)
[  +0,070327] usb 1-2: reset high-speed USB device number 2 using xhci_hcd
[  +0,138429] bnx2x 0000:04:00.1 enp4s0f1: using MSI-X  IRQs: sp 105  fp[0] 107 ... fp[3] 110
[  +0,233551] usb 1-5: reset full-speed USB device number 4 using xhci_hcd
[  +0,479041] usb 1-10: reset full-speed USB device number 7 using xhci_hcd
[  +0,415393] bnx2x 0000:04:00.0 enp4s0f0: using MSI-X  IRQs: sp 99  fp[0] 101 ... fp[3] 104
[  +0,031583] usb 1-7: reset full-speed USB device number 6 using xhci_hcd
[  +0,447019] usb 1-6: reset full-speed USB device number 5 using xhci_hcd
[  +0,447999] usb 1-3: reset full-speed USB device number 3 using xhci_hcd
[  +0,520898] logitech-hidpp-device 0003:046D:408A.0009: Disconnected
[  +0,826008] logitech-hidpp-device 0003:046D:408A.0009: HID++ 4.5 device connected.
[  +1,283936] amdgpu 0000:2d:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000036 SMN_C2PMSG_82:0x00000000
[  +0,000002] amdgpu 0000:2d:00.0: amdgpu: RunDcBtc failed!
[  +0,000001] amdgpu 0000:2d:00.0: amdgpu: Failed to setup smc hw!
[  +0,000001] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[  +0,000140] amdgpu 0000:2d:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
[  +0,000001] amdgpu 0000:2d:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -62
[  +0,000005] amdgpu 0000:2d:00.0: PM: failed to resume async: error -62
[  +0,002740] OOM killer enabled.
[  +0,000002] Restarting tasks ... done.
[  +0,000665] random: crng reseeded on system resumption
[  +0,000005] PM: suspend exit
[  +0,001472] Bluetooth: hci0: CSR: Setting up dongle with HCI ver=6 rev=22bb
[  +0,000002] Bluetooth: hci0: LMP ver=6 subver=22bb; manufacturer=10
[  +0,009963] snd_hda_intel 0000:2d:00.1: Refused to change power state from D0 to D3hot
[  +0,221051] Bluetooth: MGMT ver 1.22
[  +1,490063] bnx2x 0000:04:00.1 enp4s0f1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[  +8,734235] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=1324, emitted seq=1327
[  +0,000252] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  +0,000160] amdgpu 0000:2d:00.0: amdgpu: GPU reset begin!
[  +0,000283] amdgpu 0000:2d:00.0: amdgpu: Failed to disallow df cstate
[  +0,263935] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  +0,000002] #PF: supervisor read access in kernel mode
[  +0,000002] #PF: error_code(0x0000) - not-present page
[  +0,000001] PGD 0 P4D 0 
[  +0,000003] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0,000002] CPU: 6 PID: 5776 Comm: kworker/u64:61 Not tainted 6.7.1 #1-NixOS
[  +0,000002] Hardware name: Micro-Star International Co., Ltd. MS-7C56/B550-A PRO (MS-7C56), BIOS A.50 01/15/2021
[  +0,000002] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[  +0,000007] RIP: 0010:dc_resource_state_copy_construct+0x27/0x180 [amdgpu]
[  +0,000147] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 56 41 55 41 54 49 89 f4 55 31 ed 53 48 8b 87 08 5b 00 00 48 89 fb 44 8b b6 48 b5 03 00 <48> 8b 00 48 8b 00 80 b8 7f 01 00 00 00 74 07 48 8b ae c0 aa 03 00
[  +0,000001] RSP: 0018:ffffb944839ffc08 EFLAGS: 00010246
[  +0,000002] RAX: 0000000000000000 RBX: ffffa3aef4580000 RCX: 0000000000000000
[  +0,000002] RDX: 0000000000034e10 RSI: ffffa3aeb2200000 RDI: ffffa3aef4580000
[  +0,000001] RBP: 0000000000000000 R08: 000000000003ae40 R09: 0000000000000006
[  +0,000001] R10: 0000000000000000 R11: ffffa3b5af37b310 R12: ffffa3aeb2200000
[  +0,000001] R13: ffffa3aebeb40000 R14: 0000000000000001 R15: 0000000000000000
[  +0,000002] FS:  0000000000000000(0000) GS:ffffa3b58e500000(0000) knlGS:0000000000000000
[  +0,000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0,000001] CR2: 0000000000000000 CR3: 000000012ee82000 CR4: 0000000000f50ef0
[  +0,000002] PKRU: 55555554
[  +0,000001] Call Trace:
[  +0,000003]  <TASK>
[  +0,000003]  ? __die+0x23/0x70
[  +0,000004]  ? page_fault_oops+0x17d/0x4b0
[  +0,000005]  ? exc_page_fault+0x6e/0x160
[  +0,000004]  ? asm_exc_page_fault+0x26/0x30
[  +0,000006]  ? dc_resource_state_copy_construct+0x27/0x180 [amdgpu]
[  +0,000140]  dm_suspend+0x131/0x1e0 [amdgpu]
[  +0,000166]  amdgpu_device_ip_suspend_phase1+0x71/0xe0 [amdgpu]
[  +0,000114]  amdgpu_device_ip_suspend+0x29/0x70 [amdgpu]
[  +0,000110]  amdgpu_device_pre_asic_reset+0xd3/0x2a0 [amdgpu]
[  +0,000111]  amdgpu_device_gpu_recover+0x438/0xda0 [amdgpu]
[  +0,000112]  amdgpu_job_timedout+0x186/0x270 [amdgpu]
[  +0,000147]  drm_sched_job_timedout+0x7a/0x110 [gpu_sched]
[  +0,000006]  process_one_work+0x176/0x340
[  +0,000003]  worker_thread+0x27b/0x3a0
[  +0,000003]  ? __pfx_worker_thread+0x10/0x10
[  +0,000002]  kthread+0xd7/0x100
[  +0,000002]  ? __pfx_kthread+0x10/0x10
[  +0,000003]  ret_from_fork+0x34/0x50
[  +0,000003]  ? __pfx_kthread+0x10/0x10
[  +0,000002]  ret_from_fork_asm+0x1b/0x30
[  +0,000005]  </TASK>
[  +0,000001] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_seq af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat cmac algif_hash algif_skcipher af_alg bnep cfg80211 8021q amdgpu nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_hdmi drm_exec amdxcp drm_buddy snd_usb_audio gpu_sched snd_hda_intel drm_suballoc_helper btusb snd_intel_dspcfg drm_ttm_helper snd_intel_sdw_acpi ttm btrtl snd_hda_codec snd_usbmidi_lib btintel snd_rawmidi drm_display_helper edac_mce_amd btbcm btmtk snd_seq_device edac_core intel_rapl_msr wmi_bmof snd_hda_core mc battery intel_rapl_common crc32_pclmul snd_hwdep polyval_clmulni bluetooth bnx2x drm_kms_helper polyval_generic snd_pcm sp5100_tco gf128mul ghash_clmulni_intel snd_timer agpgart watchdog ptp ecdh_generic i2c_algo_bit snd hid_logitech pps_core rapl ff_memless video rfkill soundcore acpi_cpufreq i2c_piix4 k10temp mdio ecc evdev mousedev joydev mac_hid
[  +0,000057]  hid_multitouch wmi tpm_crb tiny_power_button gpio_amdpt tpm_tis gpio_generic tpm_tis_core button xt_conntrack ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nf_tables sch_fq_codel nfnetlink iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel atkbd libps2 serio vivaldi_fmap loop xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter veth tun tap macvlan bridge stp llc kvm_amd ccp kvm drm irqbypass fuse backlight efi_pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt aes_generic cbc encrypted_keys trusted asn1_encoder tee tpm rng_core hid_logitech_hidpp input_leds hid_logitech_dj led_class hid_generic usbhid hid ahci xhci_pci xhci_pci_renesas libahci firmware_class xhci_hcd libata nvme crc32c_intel sha512_ssse3 usbcore sha512_generic nvme_core sha256_ssse3
[  +0,000063]  sha1_ssse3 scsi_mod aesni_intel t10_pi libaes crypto_simd cryptd crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul usb_common scsi_common crct10dif_common rtc_cmos dm_mod dax
[  +0,000013] CR2: 0000000000000000
[  +0,000002] ---[ end trace 0000000000000000 ]---
[  +0,175387] RIP: 0010:dc_resource_state_copy_construct+0x27/0x180 [amdgpu]
[  +0,000150] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 56 41 55 41 54 49 89 f4 55 31 ed 53 48 8b 87 08 5b 00 00 48 89 fb 44 8b b6 48 b5 03 00 <48> 8b 00 48 8b 00 80 b8 7f 01 00 00 00 74 07 48 8b ae c0 aa 03 00
[  +0,000001] RSP: 0018:ffffb944839ffc08 EFLAGS: 00010246
[  +0,000002] RAX: 0000000000000000 RBX: ffffa3aef4580000 RCX: 0000000000000000
[  +0,000002] RDX: 0000000000034e10 RSI: ffffa3aeb2200000 RDI: ffffa3aef4580000
[  +0,000001] RBP: 0000000000000000 R08: 000000000003ae40 R09: 0000000000000006
[  +0,000001] R10: 0000000000000000 R11: ffffa3b5af37b310 R12: ffffa3aeb2200000
[  +0,000001] R13: ffffa3aebeb40000 R14: 0000000000000001 R15: 0000000000000000
[  +0,000002] FS:  0000000000000000(0000) GS:ffffa3b58e500000(0000) knlGS:0000000000000000
[  +0,000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0,000002] CR2: 0000000000000000 CR3: 000000012ee82000 CR4: 0000000000f50ef0
[  +0,000001] PKRU: 55555554
[  +0,000001] note: kworker/u64:61[5776] exited with irqs disabled
[  +4,166459] amdgpu: Move buffer fallback to memcpy unavailable
[  +0,000000] amdgpu: Move buffer fallback to memcpy unavailable
[  +0,000010] [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
[  +0,000006] [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
[ 9. Feb 22:13] amdgpu: Move buffer fallback to memcpy unavailable
[  +0,000013] [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.

Steps To Reproduce

Steps to reproduce the behavior:

  1. echo freeze > /sys/power/state
  2. wake system up using keyboard

Expected behavior

I expect the PC to resume from suspend like nothing happened.

Additional context

Kernel: 6.7.1 Motherboard: MSI B550-A PRO CPU: AMD Ryzen 7 5800X GPU: AMD Radeon RX 6600XT

Notify maintainers

Metadata

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.7.1, NixOS, 23.11 (Tapir), 23.11.20240124.a77ab16`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.1`
 - channels(root): `"nixos-23.11"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Add a :+1: reaction to issues you find important.

pfzetto commented 7 months ago

As I have forgotten to ping the maintainers, I hope that these are the correct maintainers for the linux kernel:

alyssais commented 7 months ago

Looks like upstream is working on this: https://gitlab.freedesktop.org/drm/amd/-/issues/3208

And it also looks like you could help get it fixed by testing the suggested patch.

pfzetto commented 7 months ago

Looks like upstream is working on this: https://gitlab.freedesktop.org/drm/amd/-/issues/3208

And it also looks like you could help get it fixed by testing the suggested patch.

Thanks, I'm not sure if this is the same issue (as mine happens every time, even just with hyperland running), but I will try to get that patch running.

NireBryce commented 7 months ago

Motherboard: MSI B550-A PRO

I think it's this b550 bug with a fix in-thread, I found this issue while trying to solve it this time around, I'd only fixed it on manjaro previously and it solved one of the two sleep issues.

I have a temporary fix for now. You have to disable GPP0 wakeup which is a GPP bridge to the NVMe drive in M.2 slot. Check your wakeup table using cat /proc/acpi/wakeup and look at GPP0. It should say enabled. Using sudo /bin/sh -c '/bin/echo GPP0 > /proc/acpi/wakeup' you can set it to disabled. PC should suspend normally then.

so the way to test this is

if not, repeat, with

otherwise you may have to poke I think PXE0, its mentioned in thread I believe.

NireBryce commented 7 months ago

the way to make this permanent on boot, is

# in _suspend-bugfix.nix
{ pkgs, lib, ...}:
{
 systemd.services.bugfixSuspend-GPP0 = {
      enable = true;
      description = "Fix immediate wakeup or 'zombie suspend crash' on suspend/hibernate";
      unitConfig = {
        Type = "oneshot";
      };
      serviceConfig = {
        User = "root";
         ExecStart = "-${pkgs.bash}/bin/bash -c \"echo GPP0 > /proc/acpi/wakeup\"";
        RemainAtExit = "yes";
      };
      wantedBy = ["multi-user.target"];
    };

  systemd.services.bugfixSuspend-GPP8 = {
      enable = true;
      description = "Fix immediate wakeup or 'zombie suspend crash' on suspend/hibernate";
      unitConfig = {
        Type = "oneshot";
      };
      serviceConfig = {
        User = "root";
        ExecStart = "-${pkgs.bash}/bin/bash -c \"echo GPP8 > /proc/acpi/wakeup\"";
        RemainAtExit = "yes";
        };
      wantedBy = ["multi-user.target"];
    };
}

If you manually toggled them before, you'll need to make sure they're set to 'disabled' after the first time you run this, otherwise you will have to manually toggle GPP0 and GPP8 one more time if you don't reboot.

I also added

environment.systemPackages = with pkgs; [ zenstates ];
systemd.services.before-sleep = {
      description = "_BUGFIX-suspend (Ryzen disable c6 suspend)";
      wantedBy = [ "sleep.target" "hibernate.target" ];
      before = [ "sleep.target" ];
      serviceConfig.Type = "oneshot";
      # serviceConfig.ExecStart="${before-sleep}";
      serviceConfig.ExecStart="zenstates --c6-disable";
  };

for belt-and-suspenders, but GPP0/GPP8 seem to be the issue with my b550 motherboard. (gigabyte b550m d3sh, but it seems like it's a b550 family thing)

I'm very new to nixos and there's probably better ways to do it. This one might toggle itself if you rebuild-switch, I'm going to bed and can't troubleshoot that right now.

for root cause / debugging purposes, you can bring it back from the zombie-suspend by flicking off the power switch on the PSU, playing chicken with volatile storage, and flicking it back on within 2-3 seconds, trying to get the cards to power down before you lose the sleep state in memory. Sometimes it doesn't work -- too soon and it hangs and you can try it again, too late and it just powers off or boots.

edit: its tomorrow, and I figured out (thanks to @toxicfrog's help) how to make it not toggle when you nixos-rebuild --switch

I'll try to write something for nixos-hardware this week if no one beats me to it.

pfzetto commented 7 months ago

Thanks, I've updated the BIOS to the latest version and tried your fixes. Sadly it didn't help. The PC wakes up and displays the last image, but is completely frozen (even when just using the tty). I think I will just disable hibernation and suspend until I have the time to recompile the kernel with the amdgpu patch.