[Bug]: Torch unable to use RDNA3 Card

Cleanup-Crew-From-Discord commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I have seen similar issues, but none specifically relating to users with new RDNA3 cards.

Following the guide to install on AMD based systems on linux, I run into the following error when launching:

TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2' python launch.py --precision full --no-half
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: c6f347b81f584b6c0d44af7a209983284dbb52d2
Traceback (most recent call last):
  File "/home/dingus/stable-diffusion-webui/launch.py", line 294, in <module>
    prepare_environment()
  File "/home/dingus/stable-diffusion-webui/launch.py", line 209, in prepare_environment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "/home/dingus/stable-diffusion-webui/launch.py", line 73, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "/home/dingus/stable-diffusion-webui/launch.py", line 49, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/home/dingus/stable-diffusion-webui/venv/bin/python3" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 1
stdout: <empty>
stderr: Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

One workaround mentioned by other users is adding HSA_OVERRIDE_GFX_VERSION=10.3.0 when calling launch.py to trick the system into using the gpu anyways. It worked for previous cards, but for me it abruptly segfaults.

HSA_OVERRIDE_GFX_VERSION=10.3.0 TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2' python launch.py --precision full --no-half
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: c6f347b81f584b6c0d44af7a209983284dbb52d2
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
/home/dingus/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [7460a6fa] from /home/dingus/stable-diffusion-webui/models/Stable-diffusion/sd-v1-4.ckpt
Segmentation fault (core dumped)

(maybe I missed something in the wiki but i can't find any kind of log for said dump).

adding the --skip-torch-cuda-test causes stable diffusion to only use the CPU and is agonizingly slow.

TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half --skip-torch-cuda-test
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: c6f347b81f584b6c0d44af7a209983284dbb52d2
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
Warning: caught exception 'No HIP GPUs are available', memory monitor disabled
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [7460a6fa] from /home/dingus/stable-diffusion-webui/models/Stable-diffusion/sd-v1-4.ckpt
Applying cross attention optimization (InvokeAI).
Model loaded.
Loaded a total of 0 textual inversion embeddings.
Embeddings: 
Running on local URL:  http://127.0.0.1:7860

The main line I noticed is "Warning: caught exception 'No HIP GPUs are available', memory monitor disabled." I think that this means the error comes from torch being unable to detect this card. rocminfo shows it as being properly detected by the system

Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-XX                             
  Marketing Name:          Radeon RX 7900 XT                  
  Vendor Name:             AMD

Is this caused by RDNA3 cards simply being too new and not yet supported by torch?

Steps to reproduce the problem

Follow the steps to install for AMD GPUs, but with a new RDNA3 card (specifically a 7900XT)

What should have happened?

GPU being recognized at all / program not segfaulting

Commit where the problem happens

c6f347b81f584b6c0d44af7a209983284dbb52d2

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

all args were passed in from command line and not directly added to webui.py, see above

Additional information, context and logs

System previously had a 2060 installed, but I removed it and rebuilt the stable-diffusion-webui folder for the new card. Worked flawlessly with said 2060.

aliencaocao commented 1 year ago

yes its too new for the rocm version that torch is compiled with

Cleanup-Crew-From-Discord commented 1 year ago

yes its too new for the rocm version that torch is compiled with

Well, there go my dreams of 20gb of vram and crazy compute power (for now). If you or anyone else know, how long did it take for torch to update to a rocm version that supported the previous gen cards? I'd like to at least have some kind of guess for a time frame until it becomes usable again

ZhenyaPav commented 1 year ago

Having exactly the same problem right now. This alternative kinda works, but it has limited functionality. I believe someone mentioned that RDNA2 got ROCm support only a year after it's release, so i'm not very optimistic

ice051128 commented 1 year ago

yes its too new for the rocm version that torch is compiled with

Well, there go my dreams of 20gb of vram and crazy compute power (for now). If you or anyone else know, how long did it take for torch to update to a rocm version that supported the previous gen cards? I'd like to at least have some kind of guess for a time frame until it becomes usable again

What compute power? Even the previous 6900XT got smoked by a 3050 out of the box with xformers in a stable diffusion benchmark becoz of the acceleration cuda provides. And that was also done in November this year which can also tell u about the support it has for RDNA2

gnusenpai commented 1 year ago

From what I can tell, ROCm has at least partial support for RDNA3, but I have no idea how complete it is. I've tried to build pytorch myself, but it's quite difficult and I've kinda given up on it.

Cleanup-Crew-From-Discord commented 1 year ago

Having exactly the same problem right now. This alternative kinda works, but it has limited functionality.

Will definitely give this a look for now!

DOUBLEXX666 commented 1 year ago

what guide are you following to use webui with your amd? I currently have a rx 5700 xt, I'm struggling to make it work. What guide do you recommend?

aliencaocao commented 1 year ago

@ClashSAN consider transferring this to discussion since it is a torch related issue

Cleanup-Crew-From-Discord commented 1 year ago

what guide are you following to use webui with your amd? I currently have a rx 5700 xt, I'm struggling to make it work. What guide do you recommend?

The one on this github's wiki

there was also one on reddit ive lost the link to

maikelsz commented 1 year ago

the computing power it has. cuda (as a software) per se does not provide any computing power. cuda, as a hardware arquitectures does, just as CDNA/RDNA Whether the libraries are not optimized (sometimes just left to run generic functions) to make correct use of other vendors' gpus architectures due to popularity and mindshare and the $$ thrown by a specific vendor to push their own architecture and software is a different matter.

see the example of TopaZ Video Enhance AI, which makes balanced use of GPUs from NVIDIA, AMD and INTEL

now that amd, nvidia and intel all have AI accelerator units (not to mention accelerators of other kinds and vendors), libraries/frameworks developers can no longer be so lazy or so bought$

aliencaocao commented 1 year ago

Whether the libraries are not optimized to make correct use of other vendors' gpus architectures due to popularity and mindshare and the $$ thrown by a specific vendor to support their own architecture and software is a different matter.

That is the only matter here. There is a reason why people choose nvidia GPU over amd GPU for DL despite higher prices Anyways, since you are using pytorch, you have to live with what pytorch supports.

maikelsz commented 1 year ago

Whether the libraries are not optimized to make correct use of other vendors' gpus architectures due to popularity and mindshare and the $$ thrown by a specific vendor to support their own architecture and software is a different matter.

That is the only matter here. There is a reason why people choose nvidia GPU over amd GPU for DL despite higher prices Anyways, since you are using pytorch, you have to live with what pytorch supports.

well, I mainly use TF. and yes, with my AMD...

aliencaocao commented 1 year ago

I am referring to this repo not what you use in other projects Unless you don't use this repo

maikelsz commented 1 year ago

I am referring to this repo not what you use in other projects Unless you don't use this repo

I do, pure CPU compute, but I manage. meanwhile, I use an alternative as a sidestep, for faster testing, before a full generation. anyway...

AUTOMATIC1111 / stable-diffusion-webui