[Bug]: Selecting AMD GPU with different ARCH then main GPU

lufixSch commented 10 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I run a System with a RX 6750 XT and a RX 7900 XTX. The 6750 runs perfectly but when I try to select the 7900 XT with CUDA_VISIBLE_DEVICES the webui.sh still configures everything for the 6750 XT.

Steps to reproduce the problem

Get a system with two AMD GPUs with different ARCH
Start the WebUI CUDA_VISIBLE_DEVICES=<id of secondary gpu> ./webui.sh

What should have happened?

I should be able to switch the GPU using CUDA_VISIBLE_DEVICESs

Sysinfo

sysinfo-2023-12-03-10-43.json

What browsers do you use to access the UI ?

Mozilla Firefox

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on lukas user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
python venv already activate or run without venv: /data/linux_data/AI/Stable_Diffusion/WebUI/.venvs/sd1100
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.10.12 (main, Sep  9 2023, 14:12:31) [GCC 13.2.1 20230801]
Version: v1.6.0-443-g4a666381
Commit hash: 4a666381bf98333ba4512db0f0033df5f6a08771
Launching Web UI with arguments: 
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
[-] ADetailer initialized. version: 23.9.3, num models: 9
2023-12-03 11:44:59,408 - ControlNet - INFO - ControlNet v1.1.313
ControlNet preprocessor location: /data/linux_data/AI/Stable_Diffusion/WebUI/extensions/sd-webui-controlnet/annotator/downloads
2023-12-03 11:44:59,467 - ControlNet - INFO - ControlNet v1.1.313
Loading weights [797dab5e63] from /data/linux_data/AI/Stable_Diffusion/WebUI/models/Stable-diffusion/epicphotogasm_v4One4All.safetensors
Creating model from config: /data/linux_data/AI/Stable_Diffusion/WebUI/configs/v1-inference.yaml
Memory access fault by GPU node-1 (Agent handle: 0x5594059b7700) on address 0x7f598b08e000. Reason: Page not present or supervisor privilege.
./webui.sh: line 256: 15678 Aborted                 (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

Additional information

I'm pretty sure this issue is caused by line 126 in ./webui.sh: gpu_info=$(lspci 2>/dev/null | grep -E "VGA|Display")

This command will output all GPUs of the system regardless of the value of CUDA_VISIBLE_DEVICES

00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c0)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] (rev c8)

Which means the switch-case statement selecting the torch command and setting HSA_OVERRIDE_GFX_VERSION will do this based on the first GPU listed in the command

w-e-w commented 10 months ago

I'm not familiar with this because I'm primarily a Nvidia + Windows

I'm not sure what you mean by when you wrote

CUDA_VISIBLE_DEVICES=\<id of secondary gpu> ./webui.sh

all I can see is that in the sys-info you provide you did not set environment variable CUDA_VISIBLE_DEVICES nor use --device-id flag to select your device in other words I don't see you telling webui to use a specific device, and so it uses the default device which is the first device

I belive the gpu_info=$(lspci 2>/dev/null | grep -E "VGA|Display") is just detecting what GPU you have and selecting the appropriate torch command yes I think it's based on the first GPU but that should matter in your case because both are AND right?

if it has trouble detecting the correct device and used the wrong torch command you can always overwrite it by setting TORCH_COMMAND yourself

lufixSch commented 10 months ago

CUDA_VISIBLE_DEVICES selects the GPUs visible to tools like torch. --device-id was not working for me and even if it would work it is reported, that the VRAM of the main GPU will still be used to some extend. And --device-id would still not change the outcome of the switch-case statement in webui.sh. Therefore I use CUDA_VISIBLE_DEVICES=1 to select the secondary GPU.

all I can see is that in the sys-info you provide you did not set environment variable CUDA_VISIBLE_DEVICES nor use --device-id flag to select your device

I did set CUDA_VISIBLE_DEVICES=0 (As this is the only way the WebUI starts) but it seems like multiple environment variables are missing in the sysinfo. A quick look into the source code shows that there is a white list of Environment variables listed in the sysinfo and CUDA_VISIBLE_DEVICES is not one of them

I belive the gpu_info=$(lspci 2>/dev/null | grep -E "VGA|Display") is just detecting what GPU you have and selecting the appropriate torch command

It doesn't just set the torch command but also HSA_OVERRIDE_GFX_VERSION=10.3.0 for Navi2 GPUs (e.g. the 6750 XT). If this is set when starting with a Navi3 GPU the WebUI will crash with the error given above.

w-e-w commented 10 months ago

oh ya HSA_OVERRIDE_GFX_VERSION I have no idea what that means we need an expert on this

it seems like multiple environment variables are missing in the sysinfo

I think that may be another bug that needs to be fixed

lufixSch commented 10 months ago

As far as I understand HSA_OVERRIDE_GFX_VERSION forces a specific GFX version for ROCm this is needed as 6xxx GPUs are not officially supported and therefore the right GFX version is not automatically detected. As of ROCm 5.6 7xxx GPUs are supported. Therefore this Environment variable is not needed. But forcing GFX 10.3.0 for a 7xxx GPU will break things as 7xxx GPUs use GFX 11.0.0

I think this was unclear before. I understand pretty good what is happening and why it is happening. My main question is: Does this project want to support CUDA_VISIBLE_DEVICES and if so how could we change this command to make it work? If not how could we change this command to work with --device-id? The ideal case of course would be, that it works with both options.

I think that may be another bug that needs to be fixed

Yes I think adding things like HSA_OVERRIDE_GFX_VERSION, CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES and more would be helpful.

w-e-w commented 10 months ago

--device-id CUDA_VISIBLE_DEVICES should be equivalent they're used to set the torch device string https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/b4776ea3a236c07041940ba78a50e25bc5c9a06f/modules/devices.py#L26-L30

one thing that I'm not sure which may be the thing that prevents is from working is that when you running with AMD do you still use the same cuda:0 string?

remember this is an open source project if you know what you're doing and think something is necessary and beneficial for anyone just make a PR

as far as I'm where most of us working webui are Nvidia usere so naturally AMD support is not going to be good unless theres more AMD contributors

lufixSch commented 10 months ago

one thing that I'm not sure which may be the thing that prevents is from working is that when you running with AMD do you still use the same cuda:0 string?

That could be possible. I tried it with --device-id again (removing the problematic lines in webui.sh for --device-id 1) and I was able to start it with --device-id 0. --device-id 1 crashes with a segmentation fault.

NOTE: I still have other issues regarding the 7900 XTX (probably unrelated to the stable-diffusion-webui) which could cause this error. But using CUDA_VISIBLE_DEVICES=1 it got further and exited with a (kind of) usable error message.

./webui.sh --device-id 1

```bash ################################################################ Install script for stable-diffusion + Web UI Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer. ################################################################ ################################################################ Running on lukas user ################################################################ ################################################################ Repo already cloned, using it as install directory ################################################################ ################################################################ python venv already activate or run without venv: /data/linux_data/AI/Stable_Diffusion/WebUI/.venvs/sd1100 ################################################################ ################################################################ Launching launch.py... ################################################################ Using TCMalloc: libtcmalloc_minimal.so.4 Python 3.10.12 (main, Sep 9 2023, 14:12:31) [GCC 13.2.1 20230801] Version: v1.6.0-443-g4a666381 Commit hash: 4a666381bf98333ba4512db0f0033df5f6a08771 Launching Web UI with arguments: --device-id 1 amdgpu.ids: No such file or directory amdgpu.ids: No such file or directory amdgpu.ids: No such file or directory amdgpu.ids: No such file or directory no module 'xformers'. Processing without... no module 'xformers'. Processing without... No module 'xformers'. Proceeding without it. [-] ADetailer initialized. version: 23.9.3, num models: 9 2023-12-03 15:38:30,310 - ControlNet - INFO - ControlNet v1.1.313 ControlNet preprocessor location: /data/linux_data/AI/Stable_Diffusion/WebUI/extensions/sd-webui-controlnet/annotator/downloads 2023-12-03 15:38:30,365 - ControlNet - INFO - ControlNet v1.1.313 Loading weights [797dab5e63] from /data/linux_data/AI/Stable_Diffusion/WebUI/models/Stable-diffusion/epicphotogasm_v4One4All.safetensors Creating model from config: /data/linux_data/AI/Stable_Diffusion/WebUI/configs/v1-inference.yaml ./webui.sh: line 256: 6347 Segmentation fault (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@" ```

I also observed the error AssertionError: Invalid device id from torch when using CUDA_VISIBLE_DEVICES=0 ./webui.sh --device-id 1 which means you cant overwrite the CUDA_VISIBLE_DEVICES with --device-id

Regardless of this, it means --device-id works (not sure why it didn't before) so there needs to be an other command similar to lspci which can be filtered based on --device-id value.

as far as I'm where most of us working webui are Nvidia usere so naturally AMD support is not going to be good unless theres more AMD contributors

Yea that is a recurring problem with similar projects. Stable-diffusion-webui still has one of the best AMD support out of the box. It's the first time I have issues and I would say my setup is pretty unusual.

I'd really like to contribute to this (or similar) projects especially regarding AMD support but I'm just starting too get into this topic so it may take a while...

nonnull-ca commented 6 months ago

FYI: on my system (2x RX7900XTX) HIP_VISIBLE_DEVICES and --device-id both appear to function, but only the 1st GPU works of the two. That is:

`--device-id`	`HIP_VISIBLE_DEVICES`	Observed GPU (from `rocm-smi`)	Works?
		Cell	GPU 0	Yes
	`0`	Cell	GPU 0	Yes
	`0,1`	Cell	GPU 0	Yes
	`1`	Cell	GPU 1	No
	`1,0`	Cell	GPU 1	No
`0`		Cell	GPU 0	Yes
`0`	`0`	Cell	GPU 0	Yes
`0`	`0,1`	Cell	GPU 0	Yes
`0`	`1`	Cell	GPU 1	No
`0`	`1,0`	Cell	GPU 1	No
`1`		Cell	GPU 1	No
`1`	`0`	Cell	GPU 1	No
`1`	`0,1`	Cell	GPU 1	No
`1`	`1`	Cell	GPU 0	Yes
`1`	`1,0`	Cell	GPU 0	Yes

This wouldn't be an issue, except that the GPU order is ultimately based on PCIe order (lowest ID == GPU 0), and as a result in my system GPU0 is on a gen1x4 link whereas GPU1 is on a gen4x16 link.

Interestingly, ComfyUI and exllama2 both function fine on either GPU.

lufixSch commented 6 months ago

My main issue wasn't that HIP_VISIBLE_DEVICES doesn't work but that a part of the webui.sh script sets environment variables depending on the GPU in the system but the command used for detecting the GPU doesn't adhere to the value of HIP_VISIBLE_DEVICES. As I have to different GPUs in my system it causes problems. My current solution is, that I just replaced the problematic part of the script with a implementation which looks at HIP_VISIBLE_DEVICES. Sadly this solution is not generally applicable and I haven't found a general solution. Otherwise I would have opened a PR.

As for your observation I find the behavior of the first five columns as expected. From then on it gets weird. I would have expected that either --device-id or HIP_VISIBLE_DEVICES have a higher priority an will just always override the other setting but I have a hard time observing any pattern especially for the last two collumns.

nonnull-ca commented 6 months ago

I made a mistake with the 2nd/4th last row. --device-id=1 HIP_VISIBLE_DEVICES=1 (or 0) errors.

It appears that the two options 'stack'. That is, any remapping applies twice, once for each option.

So for instance with --device-id=1 HIP_VISIBLE_DEVICES=1,0:

Original mapping is device 0 -> GPU 0 | device 1 -> GPU 1. HIP_VISIBLE_DEVICES=1,0 remaps so that device 0 -> GPU 1 | device 1 -> GPU 0. --device-id=1 then selects device 1 -> GPU 0. So you run on GPU 0.

a part of the webui.sh script sets environment variables depending on the GPU in the system

Hm, which part? Do you mean the HSA_OVERRIDE_GFX_VERSION logic? I've been launching via launch.py (e.g. HIP_VISIBLE_DEVICES=0 TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.7' python launch.py --max-batch-count 9 --device-id=0).

AUTOMATIC1111 / stable-diffusion-webui