NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.31k stars 13.54k forks source link

amdgpu: black screen after resume from suspend #170429

Open davidak opened 2 years ago

davidak commented 2 years ago

Describe the bug

i was able to connect via ssh to get the dmesg... now not even htop starts. the kernel seem to have problems

dmesg: https://gist.github.com/davidak/a98e70b4de9d20af153177a6867038d3

issue happened at 16:00, i resumed and suspended successfully before

Steps To Reproduce

Steps to reproduce the behavior:

  1. suspend
  2. resume
  3. login
  4. black screen

Expected behavior

usable computer

Screenshots

imagine a complete black image

Additional context

it seem that the problem might be caused by too many chrome tabs

maybe related to https://github.com/NixOS/nixpkgs/issues/119843

Notify maintainers

Metadata

Hardware:

Intel Core i9 9900K AMD Radeon RX 6600 XT Gigabyte Z390 UD Mainboard

davidak commented 2 years ago

I had this situation again. I had like 900 chrome tabs and RAM was 91% full at suspend. It did not come back. The kernel hangs when trying to resume.

NixOS 22.05.2711.294ef54a1e8

Sep 06 14:35:01 gaming kernel: Freezing user space processes ... (elapsed 0.578 seconds) done.
Sep 06 14:35:01 gaming kernel: OOM killer disabled.
Sep 06 14:35:01 gaming kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Sep 06 14:35:01 gaming kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Sep 06 14:35:01 gaming kernel: serial 00:01: disabled
Sep 06 14:35:01 gaming kernel: sd 4:0:0:0: [sda] Synchronizing SCSI cache
Sep 06 14:35:01 gaming kernel: sd 4:0:0:0: [sda] Stopping disk
Sep 06 14:35:01 gaming kernel: kworker/u32:11: page allocation failure: order:0, mode:0x100c02(GFP_NOIO|__GFP_HIGHMEM|__GFP_HARDWALL), nodemask=(null),cpuset=/,mems_allowed=0
Sep 06 14:35:01 gaming kernel: CPU: 15 PID: 677545 Comm: kworker/u32:11 Not tainted 5.15.62 #1-NixOS
Sep 06 14:35:01 gaming kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 UD/Z390 UD, BIOS F10 11/05/2021
Sep 06 14:35:01 gaming kernel: Workqueue: events_unbound async_run_entry_fn
Sep 06 14:35:01 gaming kernel: Call Trace:
Sep 06 14:35:01 gaming kernel:  <TASK>
Sep 06 14:35:01 gaming kernel:  dump_stack_lvl+0x46/0x5e
Sep 06 14:35:01 gaming kernel:  warn_alloc+0x138/0x160
Sep 06 14:35:01 gaming kernel:  __alloc_pages_slowpath.constprop.0+0xc89/0xcb0
Sep 06 14:35:01 gaming kernel:  __alloc_pages+0x1e9/0x220
Sep 06 14:35:01 gaming kernel:  ttm_pool_alloc+0x24a/0x5a0 [ttm]
Sep 06 14:35:01 gaming kernel:  ? __vmalloc_node+0x44/0x70
Sep 06 14:35:01 gaming kernel:  ttm_tt_populate+0xa5/0x190 [ttm]
Sep 06 14:35:01 gaming kernel:  ttm_bo_handle_move_mem+0x147/0x190 [ttm]
Sep 06 14:35:01 gaming kernel:  ttm_mem_evict_first+0x272/0x4a0 [ttm]
Sep 06 14:35:01 gaming kernel:  ? pm_uninit+0x16/0x30 [amdgpu]
Sep 06 14:35:01 gaming kernel:  ttm_resource_manager_evict_all+0xa5/0x1c0 [ttm]
Sep 06 14:35:01 gaming kernel:  amdgpu_device_suspend+0x98/0x120 [amdgpu]
Sep 06 14:35:01 gaming kernel:  pci_pm_suspend+0x71/0x160
Sep 06 14:35:01 gaming kernel:  ? pci_pm_freeze+0xc0/0xc0
Sep 06 14:35:01 gaming kernel:  dpm_run_callback+0x47/0x120
Sep 06 14:35:01 gaming kernel:  __device_suspend+0x112/0x470
Sep 06 14:35:01 gaming kernel:  async_suspend+0x1b/0x90
Sep 06 14:35:01 gaming kernel:  async_run_entry_fn+0x2d/0x130
Sep 06 14:35:01 gaming kernel:  process_one_work+0x1ee/0x390
Sep 06 14:35:01 gaming kernel:  worker_thread+0x53/0x3e0
Sep 06 14:35:01 gaming kernel:  ? process_one_work+0x390/0x390
Sep 06 14:35:01 gaming kernel:  kthread+0x124/0x150
Sep 06 14:35:01 gaming kernel:  ? set_kthread_struct+0x50/0x50
Sep 06 14:35:01 gaming kernel:  ret_from_fork+0x1f/0x30
Sep 06 14:35:01 gaming kernel:  </TASK>
Sep 06 14:35:01 gaming kernel: Mem-Info:
Sep 06 14:35:01 gaming kernel: active_anon:0 inactive_anon:6494998 isolated_anon:32
                                active_file:99 inactive_file:143 isolated_file:0
                                unevictable:28 dirty:19 writeback:4
                                slab_reclaimable:75319 slab_unreclaimable:98521
                                mapped:4 shmem:1170367 pagetables:70003 bounce:0
                                kernel_misc_reclaimable:0
                                free:102115 free_pcp:258 free_cma:0
Sep 06 14:35:01 gaming kernel: Node 0 active_anon:0kB inactive_anon:25979992kB active_file:396kB inactive_file:572kB unevictable:112kB isolated(anon):128kB isolated(file):0kB mapped:16kB dirty:76kB writeback:16kB shmem:46
81468kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:66288kB pagetables:280012kB all_unreclaimable? no
Sep 06 14:35:01 gaming kernel: Node 0 DMA free:15360kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984
kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep 06 14:35:01 gaming kernel: lowmem_reserve[]: 0 871 31987 31987 31987
Sep 06 14:35:01 gaming kernel: Node 0 DMA32 free:133572kB min:7276kB low:8168kB high:9060kB reserved_highatomic:2048KB active_anon:0kB inactive_anon:599732kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:
0kB present:979792kB managed:914256kB mlocked:0kB bounce:0kB free_pcp:224kB local_pcp:224kB free_cma:0kB
Sep 06 14:35:01 gaming kernel: lowmem_reserve[]: 0 0 31115 31115 31115
Sep 06 14:35:01 gaming kernel: Node 0 Normal free:259528kB min:259848kB low:291708kB high:323568kB reserved_highatomic:0KB active_anon:0kB inactive_anon:25380256kB active_file:396kB inactive_file:572kB unevictable:112kB w
ritepending:92kB present:32473088kB managed:31867292kB mlocked:112kB bounce:0kB free_pcp:808kB local_pcp:560kB free_cma:0kB
Sep 06 14:35:01 gaming kernel: lowmem_reserve[]: 0 0 0 0 0
Sep 06 14:35:01 gaming kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
Sep 06 14:35:01 gaming kernel: Node 0 DMA32: 7805*4kB (UME) 2040*8kB (UME) 517*16kB (ME) 294*32kB (UME) 192*64kB (UME) 110*128kB (UME) 68*256kB (UME) 46*512kB (UME) 1*1024kB (M) 0*2048kB 0*4096kB = 133572kB
Sep 06 14:35:01 gaming kernel: Node 0 Normal: 64882*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 259528kB
Sep 06 14:35:01 gaming kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep 06 14:35:01 gaming kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 06 14:35:01 gaming kernel: 1170610 total pagecache pages
Sep 06 14:35:01 gaming kernel: 0 pages in swap cache
Sep 06 14:35:01 gaming kernel: Swap cache stats: add 868314, delete 868131, find 50778/60277
Sep 06 14:35:01 gaming kernel: Free swap  = 30262216kB
Sep 06 14:35:01 gaming kernel: Total swap = 33554424kB
Sep 06 14:35:01 gaming kernel: 8367216 pages RAM
Sep 06 14:35:01 gaming kernel: 0 pages HighMem/MovableOnly
Sep 06 14:35:01 gaming kernel: 167989 pages reserved
Sep 06 14:35:01 gaming kernel: 0 pages cma reserved
Sep 06 14:35:01 gaming kernel: [drm] evicting device resources failed
Sep 06 14:35:01 gaming kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
Sep 06 14:35:01 gaming kernel: [drm] free PSP TMR buffer
Sep 06 14:35:01 gaming kernel: [TTM] Failed allocating page table
Sep 06 14:35:01 gaming kernel: [drm] evicting device resources failed
Sep 06 14:35:01 gaming kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Sep 06 14:35:01 gaming kernel: [drm] free PSP TMR buffer
Sep 06 14:35:01 gaming kernel: [TTM] Failed allocating page table
Sep 06 14:35:01 gaming kernel: [drm] evicting device resources failed
Sep 06 14:35:01 gaming kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Sep 06 14:35:01 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Sep 06 14:35:01 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
...
Sep 06 14:36:21 gaming kernel: [drm:dc_dmub_srv_cmd_queue [amdgpu]] *ERROR* Error queuing DMUB command: status=2
Sep 06 14:36:21 gaming kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
...
Sep 06 14:36:36 gaming kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Sep 06 14:36:36 gaming kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=482150, emitted seq=482152
Sep 06 14:36:36 gaming kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 06 14:36:36 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_dpp_pg_control line:60
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_hubp_pg_control line:117
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_dpp_pg_control line:92
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_hubp_pg_control line:149
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_dpp_pg_control line:68
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_hubp_pg_control line:125
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_dpp_pg_control line:84
Sep 06 14:36:36 gaming kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn302_hubp_pg_control line:141
Sep 06 14:36:36 gaming kernel: [drm:dc_dmub_srv_cmd_queue [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Sep 06 14:36:36 gaming kernel: [drm:dc_dmub_srv_cmd_queue [amdgpu]] *ERROR* Error queuing DMUB command: status=2
Sep 06 14:36:36 gaming kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Sep 06 14:36:45 gaming kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 06 14:36:45 gaming kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Sep 06 14:36:45 gaming kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 06 14:36:45 gaming kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Sep 06 14:36:48 gaming kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Sep 06 14:36:48 gaming kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable smu features.
Sep 06 14:36:48 gaming kernel: amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features!
Sep 06 14:36:48 gaming kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Sep 06 14:36:48 gaming kernel: [drm] free PSP TMR buffer
Sep 06 14:36:50 gaming kernel: [drm] psp gfx command DESTROY_TMR(0x7) failed and response status is (0x0)
Sep 06 14:36:50 gaming kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate tmr
Sep 06 14:36:50 gaming kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Sep 06 14:36:50 gaming kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Sep 06 14:36:50 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Sep 06 14:36:50 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU psp mode1 reset
Sep 06 14:36:51 gaming kernel: [drm] psp is not working correctly before mode1 reset!
Sep 06 14:36:51 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset failed
Sep 06 14:36:51 gaming kernel: amdgpu 0000:03:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:03:00.0
Sep 06 14:36:51 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 06 14:36:51 gaming kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
Sep 06 14:36:51 gaming kernel: [drm] VRAM is lost due to GPU reset!
Sep 06 14:36:51 gaming kernel: [drm] PSP is resuming...
Sep 06 14:36:51 gaming .gsd-power-wrap[1906]: Error setting property 'PowerSaveMode' on interface org.gnome.Mutter.DisplayConfig: Timeout was reached (g-io-error-quark, 24)
Sep 06 14:36:56 gaming kernel: [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
Sep 06 14:36:56 gaming kernel: [drm:psp_resume [amdgpu]] *ERROR* Failed to process memory training!
Sep 06 14:36:56 gaming kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Sep 06 14:36:56 gaming kernel: [drm] Skip scheduling IBs!
Sep 06 14:36:56 gaming kernel: [drm] Skip scheduling IBs!
Sep 06 14:36:56 gaming kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) failed
...
Sep 06 14:36:56 gaming kernel: [drm] Skip scheduling IBs!
Sep 06 14:36:56 gaming kernel: [drm] Skip scheduling IBs!
Sep 06 14:36:56 gaming kernel: [drm] Skip scheduling IBs!
...
Sep 06 14:49:21 gaming kernel: INFO: task X:cs0:1345 blocked for more than 737 seconds.
Sep 06 14:49:21 gaming kernel:       Not tainted 5.15.62 #1-NixOS
Sep 06 14:49:21 gaming kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 06 14:49:21 gaming kernel: task:X:cs0           state:D stack:    0 pid: 1345 ppid:  1315 flags:0x00004000
Sep 06 14:49:21 gaming kernel: Call Trace:
Sep 06 14:49:21 gaming kernel:  <TASK>
Sep 06 14:49:21 gaming kernel:  __schedule+0x2e1/0x1350
Sep 06 14:49:21 gaming kernel:  ? dma_fence_wait_any_timeout+0xee/0x270
Sep 06 14:49:21 gaming kernel:  ? smp_call_function_many_cond+0x72/0x2f0
Sep 06 14:49:21 gaming kernel:  schedule+0x5b/0xd0
Sep 06 14:49:21 gaming kernel:  schedule_timeout+0x104/0x140
Sep 06 14:49:21 gaming kernel:  ? dma_fence_free+0x20/0x20
Sep 06 14:49:21 gaming kernel:  ? dma_fence_add_callback+0x66/0xe0
Sep 06 14:49:21 gaming kernel:  dma_fence_wait_any_timeout+0x20f/0x270
Sep 06 14:49:21 gaming kernel:  amdgpu_sa_bo_new+0x464/0x530 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_ib_get+0x3f/0x90 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_job_alloc_with_ib+0x57/0x80 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_vm_sdma_update+0x1f7/0x270 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_vm_update_ptes+0x2b0/0x870 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_vm_bo_update_mapping+0x25d/0x4b0 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_vm_bo_update+0x2b6/0x5a0 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_vm_handle_moved+0x105/0x120 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_cs_ioctl+0x15eb/0x1eb0 [amdgpu]
Sep 06 14:49:21 gaming kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Sep 06 14:49:21 gaming kernel:  drm_ioctl_kernel+0xac/0x100 [drm]
Sep 06 14:49:21 gaming kernel:  drm_ioctl+0x21e/0x3d0 [drm]
Sep 06 14:49:21 gaming kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Sep 06 14:49:21 gaming kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Sep 06 14:49:21 gaming kernel:  __x64_sys_ioctl+0x87/0xc0
Sep 06 14:49:21 gaming kernel:  do_syscall_64+0x38/0x90
Sep 06 14:49:21 gaming kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Sep 06 14:49:21 gaming kernel: RIP: 0033:0x7fac282ade37
Sep 06 14:49:21 gaming kernel: RSP: 002b:00007fac1f045858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 06 14:49:21 gaming kernel: RAX: ffffffffffffffda RBX: 00007fac1f0458c0 RCX: 00007fac282ade37
Sep 06 14:49:21 gaming kernel: RDX: 00007fac1f0458c0 RSI: 00000000c0186444 RDI: 0000000000000010
Sep 06 14:49:21 gaming kernel: RBP: 00000000c0186444 R08: 00007fac1f0459e0 R09: 00007fac1f045988
Sep 06 14:49:21 gaming kernel: R10: 0000000000bf8210 R11: 0000000000000246 R12: 0000000000b64990
Sep 06 14:49:21 gaming kernel: R13: 0000000000000010 R14: 0000000000c05a60 R15: 0000000000c05ab8
Sep 06 14:49:21 gaming kernel:  </TASK>

the last block was always repeated

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/black-screen-when-resuming-from-suspend/10299/2

davidak commented 6 months ago

I still have similar issues with kernel 6.7.6.

Upstream issue: https://gitlab.freedesktop.org/drm/amd/-/issues/3208