AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI

GNU Affero General Public License v3.0

142.59k stars 26.89k forks source link

[Bug]: Cannot get it to run with ROCM 5.7/6.0 on Arch Linux 7900 XTX #14815

Closed ashirviskas closed 8 months ago

ashirviskas commented 9 months ago

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[ ] The issue has not been reported before recently
[ ] The issue has been reported before but has not been fixed yet

What happened?

I've tried using SD using all kinds of methods, from following the Arch instructions, tried various workarounds, installed it from scratch a few times.

Used pytorch nightly and latest stable 5.7. Nothing has worked so far, running SD I get this and it hangs indefinitely.

Python 3.10.13 (main, Dec 21 2023, 15:23:51) [GCC 13.2.1 20230801]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: --enable-insecure-extension-access --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 13.1s (prepare environment: 5.9s, import torch: 3.1s, import gradio: 1.1s, setup paths: 0.9s, other imports: 0.7s, load scripts: 0.3s, create ui: 0.4s, gradio launch: 0.6s).
Opening in existing browser session.
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.

Steps to reproduce the problem

Use arch linux with 7900 XTX
Try to use AUTOMATIC1111 SD
It doesn't work.

What should have happened?

It should work.

What browsers do you use to access the UI ?

Mozilla Firefox, Google Chrome

Sysinfo

sysinfo.json

Console logs

Python 3.10.13 (main, Dec 21 2023, 15:23:51) [GCC 13.2.1 20230801]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: --enable-insecure-extension-access --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 13.1s (prepare environment: 5.9s, import torch: 3.1s, import gradio: 1.1s, setup paths: 0.9s, other imports: 0.7s, load scripts: 0.3s, create ui: 0.4s, gradio launch: 0.6s).
Opening in existing browser session.
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.

Additional information

No response

DGdev91 commented 9 months ago

Have you tried to just open http://127.0.0.1:7860 in your browser? Maybe it's indeed running but it just doesn't open the browser automatically.

ashirviskas commented 9 months ago

Have you tried to just open http://127.0.0.1:7860/ in your browser? Maybe it's indeed running but it just doesn't open the browser automatically.

Webui works fine, just the process itself gets stuck and maxes out the GPU which can be seen here. Even after I kill the python process, GPU stays maxed out and draws additional ~100W of power and my system becomes super laggy. Only fix I found so far is a reboot.

ashirviskas commented 9 months ago

When I click generate, nothing happens in the console at all.

DGdev91 commented 9 months ago

Mmmh.... Could it possibly be related to this issue? https://github.com/ROCm/ROCm/issues/2596

There was a bug on linux 6.6 and later, it got fixed kust recently. It should be already fixed in the last kernel from ArchLinux 's repo. So, first of all check your kernel version. It should be fixed on 6.7.2

HinaHyugaHime commented 9 months ago

:wave: here to help, I have a guide to installing and deploying ROCm on archlinux https://civitai.com/articles/1503 the 7000 series is 11.0.0 on GFX version make sure to read the underlying text !

alexhegit commented 9 months ago

It works for me with ROCm5.7 + Radeon Pro W7900 (same arch of 7900XTX). Only need one code line changed to install ROCm5.7 replace ROCm5.4.2 in webui.sh as bellow.

if ! echo "$gpu_info" | grep -q "NVIDIA"; then if echo "$gpu_info" | grep -q "AMD" && [[ -z "${TORCH_COMMAND}" ]] then

rocm5.4.2"

    export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7"
fi

HinaHyugaHime commented 9 months ago

It works for me with ROCm5.7 + Radeon Pro W7900 (same arch of 7900XTX). Only need one code line changed to install ROCm5.7 replace ROCm5.4.2 in webui.sh as bellow.

if ! echo "$gpu_info" | grep -q "NVIDIA"; then if echo "$gpu_info" | grep -q "AMD" && [[ -z "${TORCH_COMMAND}" ]] then #export TORCH_COMMAND="pip install torch==2.0.1+rocm5.4.2 torchvision==0.15.2+rocm5.4.2 --index-url https://download.pytorch.org/whl/rocm5.4.2" export TORCH_COMMAND="pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7" fi fi

Thats only if his ROCm is installed, thats the pytorch not RoCm

ashirviskas commented 9 months ago

Mmmh.... Could it possibly be related to this issue? https://github.com/ROCm/ROCm/issues/2596

There was a bug on linux 6.6 and later, it got fixed kust recently. It should be already fixed in the last kernel from ArchLinux 's repo. So, first of all check your kernel version. It should be fixed on 6.7.2

Thanks, I've upgraded to 6.7.2 yesterday, but that did not seem to make any impact so it might be something else :/

👋 here to help, I have a guide to installing and deploying ROCm on archlinux https://civitai.com/articles/1503 the 7000 series is 11.0.0 on GFX version make sure to read the underlying text !

I did a very similar setup on my system, I've tried torch+rocm 5.7 stable and nightly, with both system versions - 5.7 and 6.0. I am using this custom launch.sh script to launch SD:

#!/bin/sh

source venv/bin/activate

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
#export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib 

#python3 launch.py --enable-insecure-extension-access --opt-sdp-attention
python3 launch.py

I've tried many permutations with the commented out stuff, but maybe your keen eyes will tell me that there's something wrong :crossed_fingers:

I will try a few more things from the suggestions I've got, some other ideas I have and permutations of them to see if I can get it working. Annoyingly, I have to reboot my system between attempts as it becomes not really usable.

HinaHyugaHime commented 9 months ago

Mmmh.... Could it possibly be related to this issue? ROCm/ROCm#2596

There was a bug on linux 6.6 and later, it got fixed kust recently. It should be already fixed in the last kernel from ArchLinux 's repo. So, first of all check your kernel version. It should be fixed on 6.7.2

Thanks, I've upgraded to 6.7.2 yesterday, but that did not seem to make any impact so it might be something else :/

👋 here to help, I have a guide to installing and deploying ROCm on archlinux https://civitai.com/articles/1503 the 7000 series is 11.0.0 on GFX version make sure to read the underlying text !

I did a very similar setup on my system, I've tried torch+rocm 5.7 stable and nightly, with both system versions - 5.7 and 6.0. I am using this custom launch.sh script to launch SD:
#!/bin/sh

source venv/bin/activate

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
#export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib 

#python3 launch.py --enable-insecure-extension-access --opt-sdp-attention
python3 launch.py
I've tried many permutations with the commented out stuff, but maybe your keen eyes will tell me that there's something wrong 🤞

I will try a few more things from the suggestions I've got, some other ideas I have and permutations of them to see if I can get it working. Annoyingly, I have to reboot my system between attempts as it becomes not really usable.

Sorry, but I think you need to use the official webui.sh and webui-user.sh and put the GFX in .profile and all that like I showed, if you are just experiencing really slow loading, that is not a ROCm issue, that is a 1.7.0 issue with SD, something they changed slowed it quite some bit

DGdev91 commented 9 months ago

👋 here to help, I have a guide to installing and deploying ROCm on archlinux https://civitai.com/articles/1503 the 7000 series is 11.0.0 on GFX version make sure to read the underlying text !

You are doing it more complicated than it should be. You can install the rocm packages in both pure Arch and Manjaro just by using pacman, there's no need to use yay.

Also, there's no need to add the HSA_OVERRIDE on 7900xt and 7900xtx, thery are already officially supported and found by the system as gfx1100.

Finally, you don't really need to install pytorch manually. The webui.sh install script should create the venv, activate it, and install pytorch all by itself.

I did a very similar setup on my system, I've tried torch+rocm 5.7 stable and nightly, with both system versions - 5.7 and 6.0. I am using this custom launch.sh script to launch SD:
#!/bin/sh

source venv/bin/activate

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
#export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib 

#python3 launch.py --enable-insecure-extension-access --opt-sdp-attention
python3 launch.py
I've tried many permutations with the commented out stuff, but maybe your keen eyes will tell me that there's something wrong 🤞

I will try a few more things from the suggestions I've got, some other ideas I have and permutations of them to see if I can get it working. Annoyingly, I have to reboot my system between attempts as it becomes not really usable.

As i just said, there's no need of all that custom stuff. HSA_OVERRIDE_GFX_VERSION isn't needed for your card and HIP_VISIBLE_DEVICES is only needed if you have more than one gpu (or maybe an integrated gpu bundled with the cpu)

--opt-sdp-attention can help to make the generations faster, but it also requires more vram than other attention flags. I usually prefer --opt-sub-quad-attention, but they should he both fine in your case.

If you want, you can uncomment the COMMANDLINE_ARGS and TORCH_COMMAND ries in webui-user.sh snd make them like this:

COMMANDLINE_ARGS="--opt-sdp-attention" TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7"

Delete your venv folder, just to be sure, and launch the official webui.sh script. Be patient, it will take a LONG time.

ashirviskas commented 9 months ago

Thank you both of you.

I've just cleaned it up, rebooted and now I'm doing only the barebones setup as @DGdev91 suggested, on v1.7.0.

As for things taking a long time, which part is it supposed to be? It took ~8 minutes to get and install all the requirements, now it has opened the UI. I've typed in the prompt and now nothing is happening in the terminal:

Successfully installed MarkupSafe-2.1.3 certifi-2022.12.7 charset-normalizer-2.1.1 filelock-3.9.0 fsspec-2023.4.0 idna-3.4 jinja2-3.1.2 mpmath-1.2.1 networkx-3.0rc1 numpy-1.24.1 pillow-9.3.0 pytorch-triton-rocm-3.0.0+dafe145982 requests-2.28.1 sympy-1.11.1 torch-2.3.0.dev20240201+rocm5.7 torchaudio-2.2.0.dev20240201+rocm5.7 torchvision-0.18.0.dev20240201+rocm5.7 typing-extensions-4.8.0 urllib3-1.26.13
WARNING: There was an error checking the latest version of pip.
Installing clip
Installing open_clip
Installing requirements for CodeFormer
Installing requirements
Launching Web UI with arguments: --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 237.3s (prepare environment: 232.1s, import torch: 2.0s, import gradio: 0.6s, setup paths: 0.4s, other imports: 0.5s, load scripts: 0.2s, create ui: 0.7s, gradio launch: 0.5s).
Opening in existing browser session.
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.

Right now 1 core is being maxed out and AMDGPU TOP is showing me this:

I think I've waited for it over 8 hours when this happened a week or so ago, but nothing came out of it.

EDIT: I am on ROCm 6.0.0-2 from arch testing repositories right now and my previous experience was on 5.7.x

HinaHyugaHime commented 9 months ago

👋 here to help, I have a guide to installing and deploying ROCm on archlinux https://civitai.com/articles/1503 the 7000 series is 11.0.0 on GFX version make sure to read the underlying text !

You are doing it more complicated than it should be. You can install the rocm packages in both pure Arch and Manjaro just by using pacman, there's no need to use yay.

Also, there's no need to add the HSA_OVERRIDE on 7900xt and 7900xtx, thery are already officially supported and found by the system as gfx1100.

Finally, you don't really need to install pytorch manually. The webui.sh install script should create the venv, activate it, and install pytorch all by itself.
I did a very similar setup on my system, I've tried torch+rocm 5.7 stable and nightly, with both system versions - 5.7 and 6.0. I am using this custom launch.sh script to launch SD:
#!/bin/sh

source venv/bin/activate

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HIP_VISIBLE_DEVICES=0
#export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib 

#python3 launch.py --enable-insecure-extension-access --opt-sdp-attention
python3 launch.py
I've tried many permutations with the commented out stuff, but maybe your keen eyes will tell me that there's something wrong 🤞 I will try a few more things from the suggestions I've got, some other ideas I have and permutations of them to see if I can get it working. Annoyingly, I have to reboot my system between attempts as it becomes not really usable.
As i just said, there's no need of all that custom stuff. HSA_OVERRIDE_GFX_VERSION isn't needed for your card and HIP_VISIBLE_DEVICES is only needed if you have more than one gpu (or maybe an integrated gpu bundled with the cpu)

--opt-sdp-attention can help to make the generations faster, but it also requires more vram than other attention flags. I usually prefer --opt-sub-quad-attention, but they should he both fine in your case.

If you want, you can uncomment the COMMANDLINE_ARGS and TORCH_COMMAND ries in webui-user.sh snd make them like this:

COMMANDLINE_ARGS="--opt-sdp-attention" TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7"

Delete your venv folder, just to be sure, and launch the official webui.sh script. Be patient, it will take a LONG time.

You neglect to see the point in those, yay is assuming you havent used arch and to make it easier overall installing future packages, GFX is already on SOME not all, and the reason for pre installing ROCm is to test if ROCm is running on your device properly so you know its using ROCm

DGdev91 commented 9 months ago

You neglect to see the point in those, yay is assuming you havent used arch and to make it easier overall installing future packages, GFX is already on SOME not all, and the reason for pre installing ROCm is to test if ROCm is running on your device properly so you know its using ROCm

It can be helpful in a more generic situation, but he already said he has a 7900xt and rocm is already installed (if not, it should have given a different error). I'm not saying it's wrong, i just feel there's no need to add extra steps.

Thank you both of you.

I've just cleaned it up, rebooted and now I'm doing only the barebones setup as @DGdev91 suggested, on v1.7.0.

As for things taking a long time, which part is it supposed to be? It took ~8 minutes to get and install all the requirements, now it has opened the UI. I've typed in the prompt and now nothing is happening in the terminal:
Successfully installed MarkupSafe-2.1.3 certifi-2022.12.7 charset-normalizer-2.1.1 filelock-3.9.0 fsspec-2023.4.0 idna-3.4 jinja2-3.1.2 mpmath-1.2.1 networkx-3.0rc1 numpy-1.24.1 pillow-9.3.0 pytorch-triton-rocm-3.0.0+dafe145982 requests-2.28.1 sympy-1.11.1 torch-2.3.0.dev20240201+rocm5.7 torchaudio-2.2.0.dev20240201+rocm5.7 torchvision-0.18.0.dev20240201+rocm5.7 typing-extensions-4.8.0 urllib3-1.26.13
WARNING: There was an error checking the latest version of pip.
Installing clip
Installing open_clip
Installing requirements for CodeFormer
Installing requirements
Launching Web UI with arguments: --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 237.3s (prepare environment: 232.1s, import torch: 2.0s, import gradio: 0.6s, setup paths: 0.4s, other imports: 0.5s, load scripts: 0.2s, create ui: 0.7s, gradio launch: 0.5s).
Opening in existing browser session.
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.
Right now 1 core is being maxed out and AMDGPU TOP is showing me this:

I think I've waited for it over 8 hours when this happened a week or so ago, but nothing came out of it.

EDIT: I am on ROCm 6.0.0-2 from arch testing repositories right now and my previous experience was on 5.7.x

Wierd. If the ui starts, it usually means it's ok.

I usually use CoreCtrl to check the load on the gpu. amdgpu-top should be fine to, but you can still try with that, just to be sure.

Can you show us a screenshot of the ui? In the footer there's written wich version of the ui and rocm you are using, can be useful to check if at least you are using the right versions

Also.... I suggest to try also with the linux-zen kernel. The normal linux package should in theory work just fine too, but who knows, maybe they haven't patched it yet. Linux-zen it's the kernel i'm using right now, i'm sure that works.

ashirviskas commented 9 months ago

Wierd. If the ui starts, it usually means it's ok.

Yeah, before I used to get some errors, after some troubleshooting, not anymore (I think its mostly because more things are updated and have good defaults).

I usually use CoreCtrl to check the load on the gpu. amdgpu-top should be fine to, but you can still try with that, just to be sure.

corectrl shows activity at 100%

Can you show us a screenshot of the ui? In the footer there's written wich version of the ui and rocm you are using, can be useful to check if at least you are using the right versions

Here it is:

Also.... I suggest to try also with the linux-zen kernel. The normal linux package should in theory work just fine too, but who knows, maybe they haven't patched it yet. Linux-zen it's the kernel i'm using right now, i'm sure that works.

I've just updated to 6.7.3-arch1-1, same issue, I will try zen in a moment.

ashirviskas commented 9 months ago

6.7.3-zen1-1-zen gives me something new (freedesktop errors existed before, but not GPU errors):

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on mati user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 75.2s (prepare environment: 66.1s, import torch: 3.7s, import gradio: 1.0s, setup paths: 1.0s, other imports: 1.0s, load scripts: 0.3s, create ui: 0.7s, gradio launch: 1.1s).
[2787:2809:0202/184147.652961:ERROR:object_proxy.cc(577)] Failed to call method: org.freedesktop.DBus.Properties.Get: object_path= /org/freedesktop/portal/desktop: org.freedesktop.DBus.Error.InvalidArgs: No such interface “org.freedesktop.portal.FileChooser”
[2787:2809:0202/184147.653001:ERROR:select_file_dialog_linux_portal.cc(285)] Failed to read portal version property
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.
[2824:2824:0202/184151.919380:ERROR:shared_context_state.cc(946)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
[2824:2824:0202/184151.919644:ERROR:gpu_service_impl.cc(1105)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
[2787:2787:0202/184152.980605:ERROR:gpu_process_host.cc(994)] GPU process exited unexpectedly: exit_code=8704
[2988:1:0202/184307.027986:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[3035:1:0202/184308.937478:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[2970:1:0202/184324.457780:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.

Joly0 commented 9 months ago

Hey guys, maybe someone in here can help me. I am trying to get this working in a docker container. I was able to get comfyui working in the container, but i want to tackle a1111 aswell, but it doesnt work. The command to start is this one: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.7' HSA_OVERRIDE_GFX_VERSION=10.3.0 bash webui.sh --precision full --no-half --skip-torch-cuda-test

Problem is, if i click generate it maxes out my cpu, but not my igpu (i have an r9 7950x)

DGdev91 commented 9 months ago

Hey guys, maybe someone in here can help me. I am trying to get this working in a docker container. I was able to get comfyui working in the container, but i want to tackle a1111 aswell, but it doesnt work. The command to start is this one: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.7' HSA_OVERRIDE_GFX_VERSION=10.3.0 bash webui.sh --precision full --no-half --skip-torch-cuda-test

Problem is, if i click generate it maxes out my cpu, but not my igpu (i have an r9 7950x)

Seems like rocm doesn't see your iGPU. ROCm isn't officially supported on those. I don't know if there's a way to make it work on a igpu, but even if it's possibile you probably would have terrible performances anyway. Maybe even worse than running on pure cpu.

You normally shouldn't even need --skip-torch-cuda-test if you had a working gpu.

Anyway, your problem isn't related at all to this issue, you should open a different post. And actually this isn't really the right place, since it isn't a bug in the webui.

DGdev91 commented 9 months ago

6.7.3-zen1-1-zen gives me something new (freedesktop errors existed before, but not GPU errors):

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on mati user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/mati/projects/stable-diffusion-webui/styles.csv
Loading weights [6ce0161689] from /home/mati/projects/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 75.2s (prepare environment: 66.1s, import torch: 3.7s, import gradio: 1.0s, setup paths: 1.0s, other imports: 1.0s, load scripts: 0.3s, create ui: 0.7s, gradio launch: 1.1s).
[2787:2809:0202/184147.652961:ERROR:object_proxy.cc(577)] Failed to call method: org.freedesktop.DBus.Properties.Get: object_path= /org/freedesktop/portal/desktop: org.freedesktop.DBus.Error.InvalidArgs: No such interface “org.freedesktop.portal.FileChooser”
[2787:2809:0202/184147.653001:ERROR:select_file_dialog_linux_portal.cc(285)] Failed to read portal version property
Creating model from config: /home/mati/projects/stable-diffusion-webui/configs/v1-inference.yaml
Applying attention optimization: sdp... done.
[2824:2824:0202/184151.919380:ERROR:shared_context_state.cc(946)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
[2824:2824:0202/184151.919644:ERROR:gpu_service_impl.cc(1105)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
[2787:2787:0202/184152.980605:ERROR:gpu_process_host.cc(994)] GPU process exited unexpectedly: exit_code=8704
[2988:1:0202/184307.027986:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[3035:1:0202/184308.937478:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[2970:1:0202/184324.457780:ERROR:command_buffer_proxy_impl.cc(127)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.

Well.... I never seen that error before, I don't really know what's happening. But i know several people with your same card wich are using SD WebUI just fine, so it's definitley something in your setup.

I remember you mentioned that you installed pytorch trought the AUR package, am i right? If that so, uninstall it, maybe it somehow interferes with the version from pip.

Also, for rocm... Wich package have you installed exactly? Check if rocm-hip-sdk and rocm-ml-sdk sre both installed. (I also have rocm-opencl-sdk, but it shouldn't be needed in this case)

5.7.1 and 6.0 should both be just fine.

HinaHyugaHime commented 9 months ago

Hey guys, maybe someone in here can help me. I am trying to get this working in a docker container. I was able to get comfyui working in the container, but i want to tackle a1111 aswell, but it doesnt work. The command to start is this one: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.7' HSA_OVERRIDE_GFX_VERSION=10.3.0 bash webui.sh --precision full --no-half --skip-torch-cuda-test

Problem is, if i click generate it maxes out my cpu, but not my igpu (i have an r9 7950x)

Cant deploy ROCm is docker afaik

Joly0 commented 9 months ago

Cant deploy ROCm is docker afaik

Sure you can, there are official docker images for rocm and there is even documentation for running a1111 in docker, but that doesnt work, as it only covers rocm supported gpus. It doesnt work with unsupported ones

HinaHyugaHime commented 9 months ago

Cant deploy ROCm is docker afaik

Sure you can, there are official docker images for rocm and there is even documentation for running a1111 in docker, but that doesnt work, as it only covers rocm supported gpus. It doesnt work with unsupported ones

On 7000 series maybe, but thats pointless rn being as its already supported on windows 7000 series, docker is meant to be hosted though, not ran on a daily driver

Joly0 commented 9 months ago

Cant deploy ROCm is docker afaik

Sure you can, there are official docker images for rocm and there is even documentation for running a1111 in docker, but that doesnt work, as it only covers rocm supported gpus. It doesnt work with unsupported ones

On 7000 series maybe, but thats pointless rn being as its already supported on windows 7000 series, docker is meant to be hosted though, not ran on a daily driver

Totally disagree with your opinion, but ok, was just searching for help

ashirviskas commented 9 months ago

Well.... I never seen that error before, I don't really know what's happening. But i know several people with your same card wich are using SD WebUI just fine, so it's definitley something in your setup.

Yeah, it is likely :') It might be out of the scope of SD, thank you a lot for helping out.

I remember you mentioned that you installed pytorch trought the AUR package, am i right? If that so, uninstall it, maybe it somehow interferes with the version from pip.

I am using pip version in venv, so it shouldn't impact anything, but I will try without it, I am running out of things to try.

Also, for rocm... Wich package have you installed exactly? Check if rocm-hip-sdk and rocm-ml-sdk sre both installed. (I also have rocm-opencl-sdk, but it shouldn't be needed in this case) 5.7.1 and 6.0 should both be just fine.

I do have them, here's my output of pacman -Qs rocm:

local/comgr 5.7.1-1
    Compiler support library for ROCm LLVM
local/hip-runtime-amd 5.7.1-1
    Heterogeneous Interface for Portability ROCm
local/hipblas 5.7.1-1
    ROCm BLAS marshalling library
local/hsa-rocr 5.7.1-1
    HSA Runtime API and runtime for ROCm
local/rccl 5.7.1-1
    ROCm Communication Collectives Library
local/rocalution 5.7.1-1
    Next generation library for iterative sparse solvers for ROCm platform
local/rocblas 5.7.1-1
    Next generation BLAS implementation for ROCm platform
local/rocfft 5.7.1-1
    Next generation FFT implementation for ROCm
local/rocm-clang-ocl 5.7.1-1
    OpenCL compilation with clang compiler
local/rocm-cmake 5.7.1-1
    CMake modules for common build tasks needed for the ROCm software stack
local/rocm-core 5.7.1-1
    AMD ROCm core package (version files)
local/rocm-dbgapi 5.7.1-1
    Support library necessary for a debugger of AMD's GPUs
local/rocm-device-libs 5.7.1-1
    ROCm Device Libraries
local/rocm-hip-libraries 5.7.1-2
    Develop certain applications using HIP and libraries for AMD platforms
local/rocm-hip-runtime 5.7.1-2
    Packages to run HIP applications on the AMD platform
local/rocm-hip-sdk 5.7.1-2
    Develop applications using HIP and libraries for AMD platforms
local/rocm-language-runtime 5.7.1-2
    ROCm runtime
local/rocm-llvm 5.7.1-1
    Radeon Open Compute - LLVM toolchain (llvm, clang, lld)
local/rocm-ml-libraries 5.7.1-2
    Packages for key Machine Learning libraries
local/rocm-ml-sdk 5.7.1-2
    develop and run Machine Learning applications optimized for AMD platforms
local/rocm-opencl-runtime 5.7.1-1
    OpenCL implementation for AMD
local/rocm-opencl-sdk 5.7.1-2
    Develop OpenCL-based applications for AMD platforms
local/rocm-smi-lib 5.7.1-1
    ROCm System Management Interface Library
local/rocminfo 5.7.1-1
    ROCm Application for Reporting System Info
local/rocrand 5.7.1-1
    Pseudo-random and quasi-random number generator on ROCm
local/rocsolver 5.7.1-1
    Subset of LAPACK functionality on the ROCm platform
local/rocsparse 5.7.1-1
    BLAS for sparse computation on top of ROCm
local/rocthrust 5.7.1-1
    Port of the Thrust parallel algorithm library atop HIP/ROCm

Here's my dmesg output after running webui.sh on a fresh clone and git checkout v1.7.0 + adding these two lines to webui-user.sh:

export COMMANDLINE_ARGS="--opt-sdp-attention"
export TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7"

dmesg:

[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000101000 from client 10
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: CPC (0x5)
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x1
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x1
[Sat Feb  3 11:56:43 2024] amdgpu 0000:07:00.0: amdgpu:          RW: 0x0

DGdev91 commented 9 months ago

Have you tried to add your user to the "video" and "render" groups?

Then... I know it's a bit extreme, but also trying a clean arch installation can help. Maybe there's something that has been cached somewhere from the time you were using a different gpu

ashirviskas commented 9 months ago

Have you tried to add your user to the "video" and "render" groups?

Yeah, it is already there

Then... I know it's a bit extreme, but also trying a clean arch installation can help. Maybe there's something that has been cached somewhere from the time you were using a different gpu

I mean that is as good as any other option now. Either my setup is messed up somehow, or it is a hardware problem as I've never really got ROCm to function properly besides some small toy examples. I'll report back from a fresh install, thank you for helping out.

ashirviskas commented 9 months ago

After installing arch on an old HDD, I'm back to report that a clean install did nothing.

These seem to be all related:

https://github.com/ROCm/ROCm/issues/2642 (using torch+rocm5.6 didn't work for me)
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/14128
https://aur.archlinux.org/packages/hip-runtime-amd-blender (discussion in the comments)
https://github.com/ROCm/ROCm/issues/2766
https://bbs.archlinux.org/viewtopic.php?id=284076 (general crashing in the desktop, but same error messages)

I did not find anyone who was able to fully fix these issues, ~~so it must be something to do with our hardware or hardware-driver combination.~~ It seems like this is a kernel bug.

ashirviskas commented 9 months ago

Here's my GPU bios info (amdgpu_top -> VBIOS info -> COPY)

VbiosInfo {
    name: "ASRock Navi31-XTX PGD",
    pn: "113-D70201-810009",
    ver: "022.001.002.020.000001",
    date: "2023/02/14 02:35",
    size: 57856,
}

GPU model: Asrock Radeon RX 7900 XTX Phantom Gaming OC 24GB Sadly I did not find any official BIOS update for this exact card on the official asrock website.

Maybe someone who got it working on a 7900 XTX can report your model and vBIOS version?

EDIT: Found this vBIOS, version 022.001.002.031.000001. Not sure though if I want to risk it yet though.

EDIT2: This is almost surely not a hardware issue, but a kernel/driver/rocm issue as per https://github.com/ROCm/ROCm/issues/2596

ashirviskas commented 9 months ago

Update: ROCm 6.0 packages came out of testing in arch repos today, sadly, I'm still getting all kinds of errors.

ashirviskas commented 8 months ago

Update: It finally works! I did nothing, just let the system update (and switched to Wayland in the meantime).