AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
135.34k stars 25.84k forks source link

[Bug]: WebUI will randomly stop responding to any action until I completely reinstall Ubuntu. #14128

Open Jmvars opened 7 months ago

Jmvars commented 7 months ago

Is there an existing issue for this?

What happened?

WebUI runs flawlessly for a while but will then become completely unresponsive to anything, embeddings and LoRA's load permanently and when generating it never starts and nothing happens in terminal, no progress bar shows up. I have not figured out when it happens but it seems to be when WebUI is restarted and/or PC is rebooted. The only fix I found is completely reinstalling the entire operating system.

Steps to reproduce the problem

  1. Install and run WebUI for the first time
  2. Generate cool images
  3. Shut down WebUI/restart WebUI/reboot PC
  4. Try to generate cool images
  5. It will fail

What should have happened?

WebUI should be responsive after being restarted/machine is rebooted.

Sysinfo

sysinfo-2023-11-27-12-46.txt

What browsers do you use to access the UI ?

Mozilla Firefox

Console logs

python3 launch.py --no-half --precision full
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
Version: v1.6.0-2-g4afaaf8a
Commit hash: 4afaaf8a020c1df457bcf7250cb1c7f609699fa7
Launching Web UI with arguments: --no-half --precision full
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
[-] ADetailer initialized. version: 23.11.1, num models: 9
2023-11-27 13:51:47,062 - ControlNet - INFO - ControlNet v1.1.419
ControlNet preprocessor location: /home/jm/stable-diffusion-webui/extensions/sd-webui-controlnet/annotator/downloads
2023-11-27 13:51:47,119 - ControlNet - INFO - ControlNet v1.1.419
Loading weights [8c4042921a] from /home/jm/stable-diffusion-webui/models/Stable-diffusion/aZovyaRPGArtistTools_v3VAE.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Creating model from config: /home/jm/stable-diffusion-webui/configs/v1-inference.yaml
Startup time: 37.3s (prepare environment: 20.3s, import torch: 13.6s, import gradio: 0.4s, setup paths: 0.4s, other imports: 0.3s, load scripts: 1.5s, create ui: 0.3s, gradio launch: 0.5s).

#Yes it looks like a normal startup, because it starts up normally. The problem is nothing else is printed regardless of what I do in the WebUI.

Additional information

Ubuntu 22.04 AMD RX 7900 XT PyTorch 2.0.1+ROCm 5.7

How I install WebUI is a combination of this and this guide:

1.Install driver as per the AMD guide

  1. Install python3-pip and python3.10-venv
  2. git clone WebUI
  3. cd WebUI
  4. "python3 -m venv venv" inside WebUI directory
  5. source venv/bin/activate inside WebUI directory
  6. "pip3 install --upgrade pip wheel" inside the virtual enviroment.
  7. Install PyTorch+Torchvision with ROCm 5.7 as per the AMD guide inside the virtual enviroment, using the radeon repo. I tried the official PyTorch command but that never worked.

I tried a whole bunch of installation guides, combining several guides and this is the only one I found that actually works with my GPU. As I said previously the only fix I found is to completely reinstall Ubuntu. I tried removing the WebUI folder and uninstalling drivers. This should get rid of both PyTorch and Torchvision and I confirmed so with pip outside of the virtual enviroment, I then tried rebooting, reinstalling drivers, reinstalling WebUI the exact same way but it doesn't work, only reinstalling the entire OS works.

Jmvars commented 7 months ago

Update:

My graphics driver seemingly crashed mid generation and now I have the issue again, possibly it's graphics related.

stinus0 commented 7 months ago

Hiya, I've been dealing with this exact issue, if you run sudo dmesg in console can you check if you have these errors (posted below). I'm running the 7900xtx and semi-resolved this with the following, my solution was to install ROCm (and drivers etc.) while running the Linux-6.2.0-37 kernel, then afterwards upgrading to the latest ubuntu OEM kernel (6.5) by running sudo apt install linux-oem-22.04d keep in mind that this version is not supported by ROCm but if you have the same issue it'll improve stability by a lot it. it's not a perfect solution (as it's a bit of a franken solution) but hopefully this helps


[  161.688134] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdffee000 flags=0x0020]
[  163.625128] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[  163.625136] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[  163.625139] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdc1eb000 flags=0x0000]
[  163.625140] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3C
[  163.625142] amdgpu 0000:2f:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[  163.625144] amdgpu 0000:2f:00.0: amdgpu:      MORE_FAULTS: 0x0
[  163.625146] amdgpu 0000:2f:00.0: amdgpu:      WALKER_ERROR: 0x6
[  163.625147] amdgpu 0000:2f:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  163.625150] amdgpu 0000:2f:00.0: amdgpu:      MAPPING_ERROR: 0x1
[  163.625152] amdgpu 0000:2f:00.0: amdgpu:      RW: 0x0```
Jmvars commented 7 months ago

Hiya, I've been dealing with this exact issue, if you run sudo dmesg in console can you check if you have these errors (posted below). I'm running the 7900xtx and semi-resolved this with the following, my solution was to install ROCm (and drivers etc.) while running the Linux-6.2.0-37 kernel, then afterwards upgrading to the latest ubuntu OEM kernel (6.5) by running sudo apt install linux-oem-22.04d keep in mind that this version is not supported by ROCm but if you have the same issue it'll improve stability by a lot it. it's not a perfect solution (as it's a bit of a franken solution) but hopefully this helps

Do I run this whenever or after the fault happens? Maybe a stupid question I just want to make sure.

stinus0 commented 7 months ago

shouldn't matter, most of the time (if you share the error I had) the fault will show in your dmesg pretty much from the moment you start 1111

EDIT: for your information it does not change anything in your system it gives you a readout of a log

Jmvars commented 1 month ago

@hqnicolas Thanks, It'll have to wait as my drive died and I'm waiting on a replacement.

hqnicolas commented 1 month ago

@hqnicolas Thanks, It'll have to wait as my drive died and I'm waiting on a replacement.

No way! ASRock?