[Bug]: Segmentation fault when running ROCm with RX 580 8GB

MRtecno98 commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I installed the HIP/ROCm stack on a fresh installation of Manjaro(using binaries from arch4edu) and rocminfo correctly recognizes my GPU(and CPU), when running the webui it doesn't complain about any missing gpu or cuda support, until it tries to load a model where it segfaults

Steps to reproduce the problem

Have a machine with ROCm set up
Launch the webui

What should have happened?

well, not segfault?

Commit where the problem happens

3596af07493ab7981ef92074f979eeee8fa624c4

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

ROC_ENABLE_PRE_VEGA=1 HSA_OVERRIDE_GFX_VERSION=10.3.0 TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half

Additional information, context and logs

rocminfo output:

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16323516(0xf913bc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16323516(0xf913bc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16323516(0xf913bc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 580 Series           
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26591(0x67df)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1366                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8388608(0x800000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Webui log before it segfaults:

Python 3.10.8 (main, Nov  1 2022, 14:18:21) [GCC 12.2.0]
Commit hash: 3596af07493ab7981ef92074f979eeee8fa624c4
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [925997e9] from /home/<username>/Scrivania/stable-diffusion-webui/models/Stable-diffusion/animefull-final-pruned/animefull-final-pruned.ckpt
Using VAE found similar to selected model: /home/<username>/Scrivania/stable-diffusion-webui/models/Stable-diffusion/animefull-final-pruned/animefull-final-pruned.vae.pt
Loading VAE weights from: /home/<username>/Scrivania/stable-diffusion-webui/models/Stable-diffusion/animefull-final-pruned/animefull-final-pruned.vae.pt
zsh: segmentation fault (core dumped)  ROC_ENABLE_PRE_VEGA=1 HSA_OVERRIDE_GFX_VERSION=10.3.0 TORCH_COMMAND= python

(my system is in italian, "Scrivania" in the directory names means Desktop)

I found the env flags i put in the commandline from various issues on github, without HSA_OVERRIDE_GFX_VERSION=10.3.0 torch doesn't even recognize that cuda is available

denliner commented 2 years ago

You need to downgrade HSA-ROCR to 5.3.0-2. (you can use the downgrade package https://aur.archlinux.org/packages/downgrade) arch4edu updated hsa-rocr without actually checking if it break something

thesandwichman294 commented 2 years ago

Also, have the same issue running arch linux with amd 5600g and RX570 4gb and 32gb ram.

Manually downgraded to hsa-rocr 5.3.0-2 using makepkg -si and the pkgbuild for 5.3.0-2 since the downgrade package didn't have 5.3.0-2 as an option. But it did not work and segs faults around the 2 minute mark and the screen flashes black for an instance when it happens. Using the same Command Line Arguments as MRtecno98.

For context I followed this guide after the wiki instructions resulted in "Torch is not able to use GPU". Also tried this and this but they also din't work. On windows this works but its slow 5 minutes for a pictures and using cpu with stable-diffusion-webui takes around 8 minutes. So either there is bug or I missed something since I have been trying to get the amd gpu to work for several hours now with different install methods and all the steps are jumbled up in my mind.

EmiliaTheGoddess commented 2 years ago

Same happens to me too. Downgrading to hsa-rocr 5.3.0-2 did not help. Script hangs as soon as it's hit the Global Step: xxxxxx part. No error message or anything. Kernel kills it after 30-60 seconds later. Tried with both Arch4edu and pip methods. It used to work with Arch4edu a few updates before by the way so I'm guessing it's an issue with them updating something. Not sure if it's worth mentioning but I tried both Stable Diffusion's model and Waifu Diffusion's model. None works.

rabidcopy commented 2 years ago

AFAIK you still need PyTorch packages specifically compiled to work with gfx803. https://github.com/xuhuisheng/rocm-gfx803. Drawback is these are only for Python 3.8. They're what work for me on my RX 570.

thesandwichman294 commented 2 years ago

Trying rabidcopy suggestion, following these instruction on a fresh ubuntu 20.04.05 install fails with OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

Looking at the repo issues there seems to be a solution provided by tmpuserx however stable diffusion still does not work and crashes with

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0]
Commit hash: 44c46f0ed395967cd3830dd481a2db759fda5b3b
Traceback (most recent call last):
  File "launch.py", line 294, in <module>
    prepare_enviroment()
  File "launch.py", line 209, in prepare_enviroment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "launch.py", line 73, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "launch.py", line 49, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/home/jose/Downloads/stable-diffusion-webui/venv/bin/python3" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 134
stdout: <empty>
stderr: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)

For some reason using ROC_ENABLE_PRE_VEGA=1 and HSA_OVERRIDE_GFX_VERSION=10.3.0 doesn't seem to work.

vittorio88 commented 1 year ago

@thesandwichman294 Are you entering the following in your shell before running the app? export LB_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/hip/lib This temporarily sets the environment variable LB_LIBRARY_PATH which tells the OS where to find additional libs like libmpi.

drygdryg commented 1 year ago

It seems that this is a problem with the official PyTorch build: it is built without support for GFX803 (to which the RX 580 belongs). I use Arch Linux and PyTorch from official repositories built with GFX803 support. It is built for Python 3.11, but it worked without problems in my case. So if you use Arch Linux, you can install PyTorch from Arch Linux official repositories instead of installing a build from PyTorch developers.

Install PyTorch:

If you have CPU without AVX2 support:

# pacman -S python-pytorch-rocm

If you have CPU with AVX2 support:

# pacman -S python-pytorch-opt-rocm

and TorchVision:

# pacman -S python-torchvision

Create a new virtual environment with system-site packages to provide system PyTorch:

# pacman -S virtualenv
$ cd stable-diffusion-webui/
$ rm -rf venv
$ virtualenv --system-site-packages venv

Edit webui-user.sh to disable installation of PyTorch by webui.sh:
```
export TORCH_COMMAND="pip"
```
Launch webui.sh.

Ultra119 commented 1 year ago

Using rocm version 5.5.0 fixed segfault for me (RX 580):

Re initalize your venv
Enter this: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.5.0'
Run python3 launch.py --precision full --no-half --opt-sub-quad-attention --lowvram --disable-nan-check --skip-torch-cuda-test

In this case webui.sh does not need to be touched

AUTOMATIC1111 / stable-diffusion-webui