[Bug]: Seg Fault with ROCM 7900 XT

curvedinf commented 10 months ago

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[X] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

Fresh install on Ubuntu 22.04 goes well. However, when running webui shortly after HTTP server starts up (and launches a browser window that successfully loads from the server), the server crashes with the following error. I have tested this with ROCM 5.7, 6.0, and 6.0.1. I am running text-generation-webui successfully on the rocm device (so I think its not an overall system config issue) and the device is detected properly. I previously had a 6700 XT installed that was running stable-diffusion-webui well, but the new 7900 XT is not.

Steps to reproduce the problem

Run ./webui.sh

What should have happened?

WebUI should start up normally and load a model.

What browsers do you use to access the UI ?

No response

Sysinfo

sysinfo-2024-01-25-23-06.json

Console logs

$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on xxx user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: 
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/chase/Projects/stable-diffusion-webui/styles.csv
Loading weights [aeb7e9e689] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/juggernautXL_v8Rundiffusion.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 6.0s (prepare environment: 1.7s, import torch: 1.4s, import gradio: 0.5s, setup paths: 1.1s, other imports: 0.3s, load scripts: 0.3s, create ui: 0.3s, gradio launch: 0.3s).
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Calculating sha256 for /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors: cfab6aec061f4905db12c40dc43534a26b84d0a5c0085c428729fe36e3dc056c
Loading weights [cfab6aec06] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
changing setting sd_model_checkpoint to realvisxlV30Turbo_v30TurboBakedvae.safetensors: RuntimeError
Traceback (most recent call last):
  File "/home/chase/Projects/stable-diffusion-webui/modules/options.py", line 146, in set
    option.onchange()
  File "/home/chase/Projects/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
  File "/home/chase/Projects/stable-diffusion-webui/modules/initialize_util.py", line 174, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 783, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in load_state_dict
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in <dictcomp>
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
RuntimeError: dictionary changed size during iteration

./webui.sh: line 256:  5100 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

Additional information

No response

ashirviskas commented 9 months ago

Your torch rocm version seems to be quite old, try updating to at least 5.7.

I have 7900 XTX and while I still haven't managed to get it working though

DGdev91 commented 9 months ago

First of all, use the last rocm version. I suggest you to use the official amdgpu installer tool, and follow the official instructions https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Since you just changed your gpu, try to delete your venv folder and make it download all the packages again. Also, make sure there isn't any customization in the webui_user.sh (for exaple, the HSA_OVERRIDE flag. You don't need it anymore.

curvedinf commented 9 months ago

I have solved this by updating the venv's torch and torchvision versions to the latest nightlies. I also am running the latest ROCM driver (6.0.2). There are multiple ways to update the versions, but the way I elected to do it is by updating my webui.sh script by replacing line 161 with the following:

        export TORCH_COMMAND="pip install torch==2.3.0.dev20240210+rocm6.0 torchvision==0.18.0.dev20240210+rocm6.0 --index-url https://download.pytorch.org/whl/nightly/rocm6.0"

Then I deleted the venv directory and ran webui.sh.

Just to reiterate, I have stable-diffusion-webui working with my 7900 XT with little effort. The maintainers should be able to get this working by updating the install script only.

If new torch or rocm versions become available, you can view the available torch versions on the torch pip index: https://download.pytorch.org/whl/nightly/rocm6.0

(you can also replace rocm6.0 in that url with newer or older versions of rocm to facilitate your driver version)

ronidee commented 7 months ago

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

DGdev91 commented 7 months ago

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

For 7900XT and 7900XTX the HSA_OVERRIDE_GFX_VERSION flag isn't needed at all. Not sure for other 7000-series gpus. You can try to remove it and see if it still works.

ronidee commented 7 months ago

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Update: I fixed the memory leak by omitting the --medvram flag. Now RAM stays the same and doesn't fill up over time.

DGdev91 commented 7 months ago

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Well, good to know then. Yes, most likely in your case 11.0.0 is the closer supported version, so just keep it like that.

.... there's also a patch wich has been merged some weeks ago wich should in theory make the the default config to build the tensile libs for many "not fully supported" archs, and should in theory make that flag not needed anymore in next rocm release.

But for now, just keep that.

xangelix commented 2 months ago

Anyone else getting errors that look like this on 7900xtx, or know how to deal with it?

glibc version is 2.40
Check TCMalloc: libtcmalloc_minimal.so.4
libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
Python 3.10.14 (main, Sep  5 2024, 22:06:38) [GCC 14.2.1 20240904]
Version: v1.10.1
Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
Installing torch and torchvision
Looking in indexes: https://download.pytorch.org/whl/rocm6.0
Collecting torch
  Downloading https://download.pytorch.org/whl/rocm6.0/torch-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (2363.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 GB 53.4 MB/s eta 0:00:00
Collecting torchvision
  Downloading https://download.pytorch.org/whl/rocm6.0/torchvision-0.19.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (65.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.8/65.8 MB 21.0 MB/s eta 0:00:00
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/rocm6.0/torchaudio-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 13.2 MB/s eta 0:00:00
Collecting filelock (from torch)
  Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.8.0 (from torch)
  Downloading https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting sympy (from torch)
  Downloading https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 28.9 MB/s eta 0:00:00
Collecting networkx (from torch)
  Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 30.4 MB/s eta 0:00:00
Collecting jinja2 (from torch)
  Downloading https://download.pytorch.org/whl/Jinja2-3.1.3-py3-none-any.whl (133 kB)
Collecting fsspec (from torch)
  Downloading https://download.pytorch.org/whl/fsspec-2024.2.0-py3-none-any.whl (170 kB)
Collecting pytorch-triton-rocm==3.0.0 (from torch)
  Downloading https://download.pytorch.org/whl/pytorch_triton_rocm-3.0.0-cp310-cp310-linux_x86_64.whl (341.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.7/341.7 MB 58.9 MB/s eta 0:00:00
Collecting numpy (from torchvision)
  Downloading https://download.pytorch.org/whl/numpy-1.26.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 61.1 MB/s eta 0:00:00
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision)
  Downloading https://download.pytorch.org/whl/pillow-10.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 56.8 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting mpmath>=0.19 (from sympy->torch)
  Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 248.3 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, fsspec, filelock, pytorch-triton-rocm, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.2.0 jinja2-3.1.3 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 pytorch-triton-rocm-3.0.0 sympy-1.12 torch-2.4.1+rocm6.0 torchaudio-2.4.1+rocm6.0 torchvision-0.19.1+rocm6.0 typing-extensions-4.9.0
Installing clip
Installing open_clip
Cloning assets into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets'...
Cloning Stable Diffusion into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai'...
Cloning Stable Diffusion XL into /home/tux/stable-diffusion-webui/repositories/generative-models...
Cloning into '/home/tux/stable-diffusion-webui/repositories/generative-models'...
Cloning K-diffusion into /home/tux/stable-diffusion-webui/repositories/k-diffusion...
Cloning into '/home/tux/stable-diffusion-webui/repositories/k-diffusion'...
Cloning BLIP into /home/tux/stable-diffusion-webui/repositories/BLIP...
Cloning into '/home/tux/stable-diffusion-webui/repositories/BLIP'...
Installing requirements

---

[automatic] | glibc version is 2.40
[automatic] | Check TCMalloc: libtcmalloc_minimal.so.4
[automatic] | libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
[automatic] | Python 3.10.14 (main, Sep  5 2024, 10:36:08) [GCC 14.2.1 20240805]
[automatic] | Version: v1.10.1
[automatic] | Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
[automatic] | Launching Web UI with arguments: 
[automatic] | amdgpu.ids: No such file or directory
[automatic] | amdgpu.ids: No such file or directory
[automatic] | no module 'xformers'. Processing without...
[automatic] | no module 'xformers'. Processing without...
[automatic] | No module 'xformers'. Proceeding without it.
[automatic] | Calculating sha256 for /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860
[automatic] | 
[automatic] | To create a public link, set `share=True` in `launch()`.
[automatic] | Startup time: 7.2s (prepare environment: 2.5s, import torch: 2.1s, import gradio: 0.5s, setup paths: 0.8s, other imports: 0.5s, list SD models: 0.1s, load scripts: 0.2s, create ui: 0.3s).
[automatic] | 6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
[automatic] | Loading weights [6ce0161689] from /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
[automatic] | Creating model from config: /home/tux/stable-diffusion-webui/configs/v1-inference.yaml
[automatic] | /home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[automatic] |   warnings.warn(
[automatic] | loading stable diffusion model: RuntimeError
[automatic] | Traceback (most recent call last):
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 973, in _bootstrap
[automatic] |     self._bootstrap_inner()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[automatic] |     self.run()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 953, in run
[automatic] |     self._target(*self._args, **self._kwargs)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/initialize.py", line 149, in load_model
[automatic] |     shared.sd_model  # noqa: B018
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/shared_items.py", line 175, in sd_model
[automatic] |     return modules.sd_models.model_data.get_sd_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 693, in get_sd_model
[automatic] |     load_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 845, in load_model
[automatic] |     load_model_weights(sd_model, checkpoint_info, state_dict, timer)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 440, in load_model_weights
[automatic] |     model.load_state_dict(state_dict, strict=False)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
[automatic] |     module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
[automatic] |     original(module, state_dict, strict=strict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2175, in load_state_dict
[automatic] |     load(self, state_dict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   [Previous line repeated 1 more time]
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2157, in load
[automatic] |     module._load_from_state_dict(
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
[automatic] |     linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
[automatic] |     module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_meta_registrations.py", line 4964, in zeros_like
[automatic] |     res.fill_(0)
[automatic] | RuntimeError: HIP error: shared object initialization failed
[automatic] | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[automatic] | For debugging consider passing AMD_SERIALIZE_KERNEL=3.
[automatic] | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[automatic] | 
[automatic] | 
[automatic] | 
[automatic] | Stable diffusion model failed to load
[automatic] | Applying attention optimization: Doggettx... done.
[automatic] | ./webui.sh: line 304:   191 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

xangelix commented 2 months ago

Okay, well, restarting fixed the issue for myself. If you haven't tried that I guess definitely do, even if host system libraries haven't changed. It appears that sometimes a previous GPU compute operation doesn't end or close properly, and it generates seemingly random errors until reset fully.

AUTOMATIC1111 / stable-diffusion-webui