are-we-gfx1100-yet / automatic

An opinionated SD.Next for Navi 3x
https://github.com/vladmandic/automatic
GNU Affero General Public License v3.0
1 stars 0 forks source link

[Issue]: ubuntu 22.04, amd rx7600, freeze after lock screen #2

Closed ConfusedMerlin closed 1 year ago

ConfusedMerlin commented 1 year ago

Issue Description

So, discovered by accident. This is a new ubuntu installation, and I didn't turn off the automatic "turn off screen after x min" energy option.

So it happened at least twice, in slightly different ways.

  1. screen turned off, I turned it back on, tried to switch a model, freeze
  2. screen turned off, I turned it back on, had active renders ongoing, which seemed stuck. I hit ctrl+c to shutdown the server, freeze

I am certain that at least another freeze was caused by the same, general chain of events, but I didn't think to much of it, as I expected some strange stuff happen.

When that happens, the gpu keeps on fanning, but I cannot move the mouse nor use the keyboard in any way. Well, if something does happen, I cannot see it. Can't even switch to another tty. Turning off the monitor and back on does not help either.

Version Platform Description

Python 3.10.12 on Linux (ubuntu 22.04.1, 6.2.0-26-generic, rocm 5.6, amd rx 7600)
Version: 05fc2094 Sat Aug 5 01:55:50 2023 +0800
Latest published version:
88fff06c9e5ac775c7945362a6212c36a36096f5
2023-08-13T10:58:02Z
AMD ROCm toolkit detected

Relevant log output

the console log during the freeze is not obtainable any more

sdnext.logs last lines before the freeze are (you see the skip in time, this was the freeze, reboot and so on; the freeze happend int that case after the Exiting was printed)

2023-08-13 19:11:09,265 | ultralytics | INFO | predictor | 
2023-08-13 19:11:09,292 | ultralytics | INFO | predictor | 0: 640x640 1 face, 5.9ms
2023-08-13 19:11:09,293 | ultralytics | INFO | predictor | Speed: 8.4ms preprocess, 5.9ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 640)
2023-08-13 19:18:55,224 | sd | INFO | webui | Exiting
2023-08-13 19:24:40,746 | sd | INFO | launch | Starting SD.Next

syslog of the time... sems more useful

Aug 13 19:18:12 rkai gnome-shell[2018]: DING: Detected async api for thumbnails
Aug 13 19:18:13 rkai gnome-shell[2018]: DING: GNOME nautilus 42.6
Aug 13 19:18:22 rkai nautilus[19781]: Could not delete '.meta.isrunning': Datei oder Verzeichnis nicht gefunden
Aug 13 19:18:40 rkai systemd[1]: fprintd.service: Deactivated successfully.
Aug 13 19:18:50 rkai kernel: [ 7116.211449] perf: interrupt took too long (3147 > 3131), lowering kernel.perf_event_max_sample_rate to 63500
Aug 13 19:19:13 rkai kernel: [ 7139.800386] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Aug 13 19:19:13 rkai kernel: [ 7139.800802] amdgpu: failed to remove hardware queue from MES, doorbell=0x1000
Aug 13 19:19:13 rkai kernel: [ 7139.800804] amdgpu: MES might be in unrecoverable state, issue a GPU reset
Aug 13 19:19:13 rkai kernel: [ 7139.800807] amdgpu: Failed to remove queue 0
Aug 13 19:19:14 rkai kernel: [ 7140.087830] amdgpu 0000:44:00.0: amdgpu: GPU reset begin!
Aug 13 19:19:14 rkai kernel: [ 7140.094590] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow start
Aug 13 19:19:14 rkai kernel: [ 7140.094640] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow done
Aug 13 19:19:14 rkai kernel: [ 7140.205342] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.205681] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.314039] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.314374] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.423301] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.423642] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.532078] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.532412] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.641156] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.641422] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.749865] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.750089] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai kernel: [ 7140.858568] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:14 rkai kernel: [ 7140.858936] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:14 rkai vivaldi-stable.desktop[3481]: [18608:18608:0813/191914.986852:ERROR:shared_context_state.cc(898)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
Aug 13 19:19:14 rkai vivaldi-stable.desktop[3481]: [18608:18608:0813/191914.987243:ERROR:gpu_service_impl.cc(1010)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
Aug 13 19:19:15 rkai vivaldi-stable.desktop[3481]: [3476:3476:0813/191915.004936:ERROR:gpu_process_host.cc(955)] GPU process exited unexpectedly: exit_code=8704
Aug 13 19:19:15 rkai kernel: [ 7140.967434] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7140.967789] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.076193] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
Aug 13 19:19:15 rkai kernel: [ 7141.076519] [drm:amdgpu_mes_add_hw_queue [amdgpu]] *ERROR* failed to add hardware queue to MES, doorbell=0x1000
Aug 13 19:19:15 rkai kernel: [ 7141.076755] [drm:amdgpu_mes_self_test [amdgpu]] *ERROR* failed to add ring
Aug 13 19:19:15 rkai kernel: [ 7141.185005] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.185299] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.293490] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.293762] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.402030] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.402305] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.510502] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.510764] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.619440] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
Aug 13 19:19:15 rkai kernel: [ 7141.619766] [drm:amdgpu_mes_add_hw_queue [amdgpu]] *ERROR* failed to add hardware queue to MES, doorbell=0x1000
Aug 13 19:19:15 rkai kernel: [ 7141.620000] [drm:amdgpu_mes_self_test [amdgpu]] *ERROR* failed to add ring
Aug 13 19:19:15 rkai kernel: [ 7141.725611] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.725948] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.728387] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.728657] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait

...
Aug 13 19:19:15 rkai kernel: [ 7141.510502] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.510764] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.619440] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
Aug 13 19:19:15 rkai kernel: [ 7141.619766] [drm:amdgpu_mes_add_hw_queue [amdgpu]] *ERROR* failed to add hardware queue to MES, doorbell=0x1000
Aug 13 19:19:15 rkai kernel: [ 7141.620000] [drm:amdgpu_mes_self_test [amdgpu]] *ERROR* failed to add ring
Aug 13 19:19:15 rkai kernel: [ 7141.725611] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.725948] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.728387] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.728657] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.834242] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.834588] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:15 rkai kernel: [ 7141.837144] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:15 rkai kernel: [ 7141.837491] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:16 rkai kernel: [ 7141.946028] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:16 rkai kernel: [ 7141.946357] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:16 rkai kernel: [ 7141.958238] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:16 rkai kernel: [ 7141.958579] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:16 rkai kernel: [ 7142.054637] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:16 rkai kernel: [ 7142.054909] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:16 rkai kernel: [ 7142.066771] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
Aug 13 19:19:16 rkai kernel: [ 7142.067016] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Aug 13 19:19:16 rkai kernel: [ 7142.163467] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
Aug 13 19:19:16 rkai kernel: [ 7142.163779] [drm:amdgpu_mes_add_hw_queue [amdgpu]] *ERROR* failed to add hardware queue to MES, doorbell=0x1200
Aug 13 19:19:16 rkai kernel: [ 7142.164046] amd_iommu_report_page_fault: 7 callbacks suppressed
Aug 13 19:19:16 rkai kernel: [ 7142.164051] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.164060] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.164067] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.164075] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.164035] [drm:amdgpu_mes_self_test [amdgpu]] *ERROR* failed to add ring
Aug 13 19:19:16 rkai kernel: [ 7142.164572] amdgpu 0000:44:00.0: amdgpu: GPU reset(2) succeeded!
Aug 13 19:19:16 rkai kernel: [ 7142.170304] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.189262] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.199108] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.206087] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:16 rkai kernel: [ 7142.209629] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:17 rkai vivaldi-stable.desktop[3481]: [20018:20018:0813/191917.289610:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 1 times!
Aug 13 19:19:17 rkai vivaldi-stable.desktop[3481]: [20018:20018:0813/191917.298106:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 2 times!
Aug 13 19:19:17 rkai vivaldi-stable.desktop[3481]: [20018:20018:0813/191917.313557:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 3 times!
Aug 13 19:19:19 rkai kernel: [ 7145.639988] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:21 rkai kernel: [ 7147.497732] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:21 rkai kernel: [ 7147.497755] amdgpu 0000:44:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x2c7e4000 flags=0x0020]
Aug 13 19:19:23 rkai vivaldi-stable.desktop[3481]: [5900:1:0813/191923.893844:ERROR:command_buffer_proxy_impl.cc(128)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
Aug 13 19:19:32 rkai kernel: [ 7158.049702] amdgpu 0000:44:00.0: [drm] *ERROR* [CRTC:69:crtc-0] flip_done timed out
Aug 13 19:23:10 rkai systemd-modules-load[579]: Inserted module 'lp'


### Acknowledgements

- [X] I have read the above and searched for existing issues
- [X] I confirm that this is classified correctly and its not an extension or diffusers-specific issue
evshiron commented 1 year ago

Thank you for submitting the issue.

Based on the logs from dmesg (well syslog), it seems that the AMDGPU driver entered an error loop. Perhaps sudo rocm-smi --gpureset -d 0 can restore from it, but usually I just reboot.

I am using Kubuntu 22.04, with a lock screen timeout set to 30 minutes, but I haven't encountered this issue. I'm not sure what the cause is. It could possibly be a driver problem.

ConfusedMerlin commented 1 year ago

All I can do is turning it out. hard. Because no input works any more. I think I may be able to reproduce it by force by simply hitting win+l to lock the screen, leave it be for some minutes, than unlock it. But I must admit that I am not keen on freezing my system forcefully.

evshiron commented 1 year ago

You should be able to SSH into it while the GPU freezes.

In the early day my RX 7900 XTX freezes a lot too, and I connect to it with another device. But it's quite stable now.

ConfusedMerlin commented 1 year ago

I would need another device here to do so, which isn't the case currently.

Speaking of freezes, it just did it again, just about one minute after I send that last comment. No locking, absolutely no error log anywhere, just a frozen screen and nothing to do but hard resetting again. There is always something, somehow...

EDiT: tried gpureset, just to see what to expect... screen went black, then went back on, but the xwindow session didn't want to start any more. sigh

evshiron commented 1 year ago

If there is an error with the AMDGPU driver, relevant information should be included in dmesg.

In my case, the AMDGPU driver usually works fine, but it easily enters this error state when the VRAM is exhausted (such as when running two WebUIs simultaneously for generating) and it is difficult to recover without a reboot.

Here are some troubleshooting tips that might help you:

If it is an image generation problem, I suggest using Tiled VAE and setting the Decoder Tile Size to a smaller value (such as 64 or 96). This can save a lot of VRAM and reduce the load on the GPU.

If it freezes during idle times, I can only attribute it to a driver issue and cannot provide effective assistance.

You may try export HSA_OVERRIDE_GFX_VERSION=11.0.2 and see if there is any difference.

ConfusedMerlin commented 1 year ago

So, I can neither reproduce it nor predict it... for now, I must close this, as I cannot provide any useful informations. I mean, it actually froze once when the SD Webui was idle in a background browser tab and I was looking through civitai for Loras about egyptian buildings. sigh

ConfusedMerlin commented 1 year ago

cannot reproduce, also it seems to be not caused by the webui itself, but only get encouraged to appear with a higher chance if it is running