[Bug]: Segmentation fault running on docker with Radeon 5700

badarg1 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I'm tryin to run this in a docker container on an Ubuntu 22.04.1 machine with a Radeon 5700 ITX GPU (8 GB), a Ryzen 5 3600 CPU, and 16 GB of RAM.

I followed the instructions from the wiki: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#running-inside-docker

When I try to start the UI, I get a segmentation fault:

(venv) root@borg:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half
Python 3.9.5 (default, Nov 23 2021, 15:27:38) 
[GCC 9.3.0]
Commit hash: c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
Applying cross attention optimization (Doggettx).
Segmentation fault (core dumped)

I upgraded to python 3.9 following the instructions in the wiki, with the same results.

I suspect it might be related to this other issue, but created a new issue as I'm not sure: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6403

Steps to reproduce the problem

Set up the docker container as instructed in the wiki
Start the UI with the command provided in the wiki

What should have happened?

UI should start up.

Commit where the problem happens

c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

No response

Command Line Arguments

I use `--precision full` and `--no-half` as instructed in the wiki.

I also tried removing them in any combination, with no result.

Additional information, context and logs

The image id of the docker image I'm using is 614789dfdb38.

Find the dumped core here (2.8 GB): https://drive.google.com/file/d/1n-ulnrYZ1pjkF9xUJgYasCY3qk8rr5vW/view?usp=share_link

wsippel commented 1 year ago

Seeing possibly the same thing on a 7900XTX, running directly on Arch Linux (without Docker):

LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /home/wsippel/Applications/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
fish: Job 1, './webui.sh' terminated by signal SIGSEGV (Address boundary error)

sALTaccount commented 1 year ago

Same issue, except I'm using a radeon duo pro polaris and no docker

irusensei commented 1 year ago

Same issue here. No docker. Spotted this on dmesg output:

[13066.414044] python3[140319]: segfault at 20 ip 00007fbd318d71d2 sp 00007fff7bc3fcd0 error 4 in libamdhip64.so.5.4.50401[7fbd3181f000+351000]

JilekJosef commented 1 year ago

Same issue. RX 6600

JilekJosef commented 1 year ago

Oh, actually I just found solution or at least for myself. I used the HSA_OVERRIDE_GFX_VERSION=10.3.0 fix and run the command below and the segmentation fault disappeared. TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py

irusensei commented 1 year ago

I was still getting errors on my 6800m even with HSA_OVERRIDE_GFX_VERSION=10.3.0. It basically ends up with:

 terminate called after throwing an instance of 'miopen::Exception'
  what():  /MIOpen/src/hipoc/hipoc_program.cpp:300: Code object build failed. Source: naive_conv.cpp
Aborted (core dumped)

After a bit go googling I find some settings that mitigate the problem but I have no idea if setting those which I assume disable naive_conv affect the performance or results.

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

It's working now but I have to manage memory even on a 12GB memory GPU. Example: cant do more than 3-4 batches.

flaep commented 1 year ago

same issue here with a RX580 and RX480. neither

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

nor adding HSA_OVERRIDE_GFX_VERSION=10.3.0

helps

connor-corso commented 1 year ago

This worked for me on Fedora with a 5600g and a 6600xt, 16gb of ram export AMDGPU_TARGETS="gfx1010" export HSA_OVERRIDE_GFX_VERSION=10.3.0 TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' python launch.py --precision full --no-half

I found this from here link but did not have to do step 2

badarg1 commented 1 year ago

Setting export HSA_OVERRIDE_GFX_VERSION=10.3.0 seems to help and the web UI now loads, but I still can't seem to make it work.

When the web UI loads it prints this on the console:

(venv) root@borg:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half
Python 3.9.5 (default, Nov 23 2021, 15:27:38) 
[GCC 9.3.0]
Commit hash: c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0): 
Model loaded.
Running on local URL:  http://127.0.0.1:7860

After I try to generate a txt2img using the default settings and a simple sentence as prompt, it prints this:

To create a public link, set `share=True` in `launch()`.
  0%|                                                    | 0/20 [00:00<?, ?it/s]

That seems to be a progress bar, but it does not progress. The progress bar in the web UI also does not move. If I hide the progress bar from the web UI (by adding the hidden attribute to it) I find it says 4/75 underneath, but that also does not advance. I left the process running for over 30 minutes and still nothing. How long should this process take with the hardware described in the first post?

BTW, there seems to be activity in the CPU, but only on 1 core:

$ docker stats wonderful_greider --no-stream
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O         PIDS
a34014c5911b   wonderful_greider   101.73%   3.772GiB / 15.57GiB   24.23%    0B / 0B   7.28GB / 2.14MB   22

The model I'm using is 4 GB. Maybe it's running on the CPU instead of the GPU?

There is also activity on the GPU, although the cool temperature and low power consumption makes me suspicious:

# rocm-smi

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU[0]      : sclk current clock frequency not found
================================================================================
GPU  Temp (DieEdge)  AvgPwr  SCLK  MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    56.0c           50.0W   None  500Mhz  34.12%  auto  150.0W   60%   99%   
================================================================================
============================= End of ROCm SMI Log ==============================

The VRAM% oscillates between 59% and 63%.

49RpK5dY commented 1 year ago

@badarg1 Try using other docker image. For whatever reason rocm/pytorch:latest doesn't work for me on RX 5700 since rocm5.3 and gets stuck at 0%. Try rocm/pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21 or any of the official rocm5.2.3 from rocm/pytorch. Those work perfectly.

badarg1 commented 1 year ago

@49RpK5dY I tried with rocm/pytorch:rocm5.2.3_ubuntu20.04_py3.7_pytorch_1.12.1 and it worked. Thank you.

Maybe these workarounds should be documented in the wiki?

49RpK5dY commented 1 year ago

Maybe these workarounds should be documented in the wiki?

I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.

Dolidodzik commented 1 year ago

@VBBr could you explain in some more details what did you to do make it work? I have very similar issue on my rx570.

deba33 commented 1 year ago

got same error on Manjaro with AMD CPU and GPU. https://stackoverflow.com/questions/75591043/got-segmentation-fault-while-launching-stable-diffusion-webui-webui-sh

linwownil commented 1 year ago

I was getting the segmentation fault error with the Automatic Installation guide as well, with RX6800 in Artix Linux. Since then, I have found another installation method for Arch-based distributions, which involves using PyTorch and Torchvision built with ROCm from Arch repos.

You can find the written guide here for now, hoping it will be included in this repo too (pointing to #8170)

zhouhao27 commented 1 year ago

Got the same error in Mac m1 cpu. It was fine but suddenly got this error and then can not start anymore.

achhabra2 commented 1 year ago

I'm also running into the Segmentation Fault issue exactly as you mentioned. Except I am not using docker. Running on Fedora 37. I manually created a Python 3.10 virtual environment because Python 3.11 was the default installation.

5950x CPU + 7900 XTX GPU, 32 gb ram

chirvo commented 1 year ago

Same error here. Anyone has come up with a solution, or at least a workaround?

Ryzen 5950x, RX 7900 XTX, 64 GB RAM

wsippel commented 1 year ago

@achhabra2 @bigchirv the 7900 segfaults are its own thing. Pytorch on RDNA3 simply isn't supported in ROCm 5.4. It'll hopefully be fixed with the upcoming 5.5 release.

chirvo commented 1 year ago

Thanks for the heads up!

echoidcf commented 1 year ago

Maybe these workarounds should be documented in the wiki?

I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.

Hey, are you ok with your 5700? I have exactly the same problem, can you tell me the detail you are using now? amdgpu version/docker image/python version etc

49RpK5dY commented 1 year ago

The docker image is pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21 but any older image with rocm5.2 should work. I also updated python, the wiki instructions for that are still working. https://download.pytorch.org/whl/rocm5.1.1 no longer works as it's not available any longer. You can install pytorch in venv with this instead: pip install torch==1.13.0+rocm5.2 torchvision==0.14.0+rocm5.2 --extra-index-url https://download.pytorch.org/whl/rocm5.2 and launch with TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2' python launch.py As for other launching parameters I'm using --medvram --no-half --no-half-vae --opt-sub-quad-attention. It will generate gray squares without --no-half and --medvram --opt-sub-quad-attention saves a lot of vram.

zeze0556 commented 1 year ago

Same error here

RX590 + ubuntu 22.04 + amdgpu-install 5.4.5

segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

echoidcf commented 1 year ago

Same error here

RX590 + ubuntu 22.04 + amdgpu-install 5.4.5

segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

shelbydavis commented 1 year ago

Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

The original bug is on a Ryzen 5 3600 which includes AVX2 instructions (verified using cat /proc/cpuinfo on my 3600)

I'm getting the same issue on a clean install of Ubuntu Server 22.04.2 with ROCm 5.5.1 and building pytorch / torchvision from source.

zeze0556 commented 1 year ago

Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

My cpu (Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz) support AVX2 . I have tested rocm 5.4.x and 5.5, and both have the same error.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x28 cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 7995.18 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:

echoidcf commented 1 year ago

Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

My cpu (Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz) support AVX2 . I have tested rocm 5.4.x and 5.5, and both have the same error.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x28 cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 7995.18 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:

You can first try to run a python with just import pytorch If this failed, it is the AVX2 things, If not, may be something else wrong.

tutu329 commented 1 year ago

i add some print() in webui.py the Segmentation fault problem is caused by "import pytorch_lightning" after i run: pip install protobuf --upgrade Segmentation fault gone(because its the version problem between pytorch_lightning and protobuf)