Open badarg1 opened 1 year ago
Seeing possibly the same thing on a 7900XTX, running directly on Arch Linux (without Docker):
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /home/wsippel/Applications/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
fish: Job 1, './webui.sh' terminated by signal SIGSEGV (Address boundary error)
Same issue, except I'm using a radeon duo pro polaris and no docker
Same issue here. No docker. Spotted this on dmesg output:
[13066.414044] python3[140319]: segfault at 20 ip 00007fbd318d71d2 sp 00007fff7bc3fcd0 error 4 in libamdhip64.so.5.4.50401[7fbd3181f000+351000]
Same issue. RX 6600
Oh, actually I just found solution or at least for myself. I used the HSA_OVERRIDE_GFX_VERSION=10.3.0 fix and run the command below and the segmentation fault disappeared. TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py
I was still getting errors on my 6800m even with HSA_OVERRIDE_GFX_VERSION=10.3.0. It basically ends up with:
terminate called after throwing an instance of 'miopen::Exception'
what(): /MIOpen/src/hipoc/hipoc_program.cpp:300: Code object build failed. Source: naive_conv.cpp
Aborted (core dumped)
After a bit go googling I find some settings that mitigate the problem but I have no idea if setting those which I assume disable naive_conv affect the performance or results.
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0
It's working now but I have to manage memory even on a 12GB memory GPU. Example: cant do more than 3-4 batches.
same issue here with a RX580 and RX480. neither
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0
nor adding
HSA_OVERRIDE_GFX_VERSION=10.3.0
helps
This worked for me on Fedora with a 5600g and a 6600xt, 16gb of ram
export AMDGPU_TARGETS="gfx1010"
export HSA_OVERRIDE_GFX_VERSION=10.3.0
TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' python launch.py --precision full --no-half
I found this from here link but did not have to do step 2
Setting export HSA_OVERRIDE_GFX_VERSION=10.3.0
seems to help and the web UI now loads, but I still can't seem to make it work.
When the web UI loads it prints this on the console:
(venv) root@borg:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half
Python 3.9.5 (default, Nov 23 2021, 15:27:38)
[GCC 9.3.0]
Commit hash: c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0):
Model loaded.
Running on local URL: http://127.0.0.1:7860
After I try to generate a txt2img using the default settings and a simple sentence as prompt, it prints this:
To create a public link, set `share=True` in `launch()`.
0%| | 0/20 [00:00<?, ?it/s]
That seems to be a progress bar, but it does not progress.
The progress bar in the web UI also does not move.
If I hide the progress bar from the web UI (by adding the hidden
attribute to it) I find it says 4/75
underneath, but that also does not advance.
I left the process running for over 30 minutes and still nothing.
How long should this process take with the hardware described in the first post?
BTW, there seems to be activity in the CPU, but only on 1 core:
$ docker stats wonderful_greider --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
a34014c5911b wonderful_greider 101.73% 3.772GiB / 15.57GiB 24.23% 0B / 0B 7.28GB / 2.14MB 22
The model I'm using is 4 GB. Maybe it's running on the CPU instead of the GPU?
There is also activity on the GPU, although the cool temperature and low power consumption makes me suspicious:
# rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU[0] : sclk current clock frequency not found
================================================================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 56.0c 50.0W None 500Mhz 34.12% auto 150.0W 60% 99%
================================================================================
============================= End of ROCm SMI Log ==============================
The VRAM%
oscillates between 59% and 63%.
@badarg1 Try using other docker image. For whatever reason rocm/pytorch:latest
doesn't work for me on RX 5700 since rocm5.3 and gets stuck at 0%. Try rocm/pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21
or any of the official rocm5.2.3 from rocm/pytorch. Those work perfectly.
@49RpK5dY I tried with rocm/pytorch:rocm5.2.3_ubuntu20.04_py3.7_pytorch_1.12.1
and it worked. Thank you.
Maybe these workarounds should be documented in the wiki?
Maybe these workarounds should be documented in the wiki?
I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.
@VBBr could you explain in some more details what did you to do make it work? I have very similar issue on my rx570.
got same error on Manjaro with AMD CPU and GPU. https://stackoverflow.com/questions/75591043/got-segmentation-fault-while-launching-stable-diffusion-webui-webui-sh
I was getting the segmentation fault error with the Automatic Installation guide as well, with RX6800 in Artix Linux. Since then, I have found another installation method for Arch-based distributions, which involves using PyTorch
and Torchvision
built with ROCm from Arch repos.
You can find the written guide here for now, hoping it will be included in this repo too (pointing to #8170)
Got the same error in Mac m1 cpu. It was fine but suddenly got this error and then can not start anymore.
I'm also running into the Segmentation Fault issue exactly as you mentioned. Except I am not using docker. Running on Fedora 37. I manually created a Python 3.10 virtual environment because Python 3.11 was the default installation.
5950x CPU + 7900 XTX GPU, 32 gb ram
Same error here. Anyone has come up with a solution, or at least a workaround?
Ryzen 5950x, RX 7900 XTX, 64 GB RAM
@achhabra2 @bigchirv the 7900 segfaults are its own thing. Pytorch on RDNA3 simply isn't supported in ROCm 5.4. It'll hopefully be fixed with the upcoming 5.5 release.
Thanks for the heads up!
Maybe these workarounds should be documented in the wiki?
I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.
Hey, are you ok with your 5700? I have exactly the same problem, can you tell me the detail you are using now? amdgpu version/docker image/python version etc
The docker image is pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21
but any older image with rocm5.2 should work. I also updated python, the wiki instructions for that are still working. https://download.pytorch.org/whl/rocm5.1.1
no longer works as it's not available any longer. You can install pytorch in venv with this instead:
pip install torch==1.13.0+rocm5.2 torchvision==0.14.0+rocm5.2 --extra-index-url https://download.pytorch.org/whl/rocm5.2
and launch with
TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2' python launch.py
As for other launching parameters I'm using --medvram --no-half --no-half-vae --opt-sub-quad-attention
. It will generate gray squares without --no-half
and --medvram --opt-sub-quad-attention
saves a lot of vram.
Same error here
RX590 + ubuntu 22.04 + amdgpu-install 5.4.5
segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]
Same error here
RX590 + ubuntu 22.04 + amdgpu-install 5.4.5
segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]
OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.
Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]
OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.
The original bug is on a Ryzen 5 3600 which includes AVX2 instructions (verified using cat /proc/cpuinfo
on my 3600)
I'm getting the same issue on a clean install of Ubuntu Server 22.04.2 with ROCm 5.5.1 and building pytorch / torchvision from source.
Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]
OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.
My cpu (Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz) support AVX2 . I have tested rocm 5.4.x and 5.5, and both have the same error.
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x28 cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 7995.18 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]
OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.
My cpu (Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz) support AVX2 . I have tested rocm 5.4.x and 5.5, and both have the same error.
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x28 cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 7995.18 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
You can first try to run a python with just import pytorch If this failed, it is the AVX2 things, If not, may be something else wrong.
i add some print() in webui.py the Segmentation fault problem is caused by "import pytorch_lightning" after i run: pip install protobuf --upgrade Segmentation fault gone(because its the version problem between pytorch_lightning and protobuf)
@tutu329 , can you help note what code you added to webui.py to print() the cuase of your segmentation fault? I'd like to try this as well.
very simple like this: print('===============1=================') print('===============2=================') ... print('===============n=================') then you can find the bug :)
very simple like this: print('===============1=================') print('===============2=================') ... print('===============n=================')
Sorry, I'm not an adept python coder. I know in the .py file, print('example') is used to print some text. I'm not clear on where you're putting the "print" lines in the webui.py file or what you're actually telling it to print. Are the equals signs in your example indicative of something specific, like a line of code, or was that just an example? Apologies for not understanding. I did try to look this up through examples of the print file and printing exceptions, but I'm not quite getting it.
yes, i have played with python for some years and c++ many years. the print trick is just to bet i can solve the problem after i find the bug place. simply, i just put print where there would be a bug i guess in source code. i will use the print trick because the webui is a simple project through it is so good. i believe i can solve the bug and luckily it is just a version problem.
very simple like this: print('===============1=================') print('===============2=================') ... print('===============n=================')
Sorry, I'm not an adept python coder. I know in the .py file, print('example') is used to print some text. I'm not clear on where you're putting the "print" lines in the webui.py file or what you're actually telling it to print. Are the equals signs in your example indicative of something specific, like a line of code, or was that just an example? Apologies for not understanding. I did try to look this up through examples of the print file and printing exceptions, but I'm not quite getting it.
if ===================1============== appears, it meas code can run to here. that is the trick
I'm finding my segmentation fault occurs at the point when it's waiting for the server.
timer.startup_record = startup_timer.dump() print(f"Startup time: {startup_timer.summary()}.") try: while True: server_command = shared.state.wait_for_server_command(timeout=5)
I extended the timeout to 60 but the segmentation fault occured in about the same time frame as before. Any idea why the seg fault is occuring here?
I'm finding my segmentation fault occurs at the point when it's waiting for the server.
timer.startup_record = startup_timer.dump() print(f"Startup time: {startup_timer.summary()}.") try: while True: server_command = shared.state.wait_for_server_command(timeout=5)
I extended the timeout to 60 but the segmentation fault occured in about the same time frame as before. Any idea why the seg fault is occuring here?
it is strange. mostly, segment fault never happens in normal python code. it is like pointer error in c++. so often because python module version problem. i dont think your problem is timeout. my amd 6900xt only works in torch1.13+rocm5.2(ubuntu22.04). recommend you create a new clean conda environment to try install new torchrocm like 1.13+5.2 and install webui step by step(remember pip install -r requirement.txt)
I'm finding my segmentation fault occurs at the point when it's waiting for the server.
timer.startup_record = startup_timer.dump() print(f"Startup time: {startup_timer.summary()}.") try: while True: server_command = shared.state.wait_for_server_command(timeout=5)
I extended the timeout to 60 but the segmentation fault occured in about the same time frame as before. Any idea why the seg fault is occuring here?
what is your result of: import torch torch.cuda.is_available()
Is there an existing issue for this?
What happened?
I'm tryin to run this in a docker container on an Ubuntu 22.04.1 machine with a Radeon 5700 ITX GPU (8 GB), a Ryzen 5 3600 CPU, and 16 GB of RAM.
I followed the instructions from the wiki: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#running-inside-docker
When I try to start the UI, I get a segmentation fault:
I upgraded to python 3.9 following the instructions in the wiki, with the same results.
I suspect it might be related to this other issue, but created a new issue as I'm not sure: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6403
Steps to reproduce the problem
What should have happened?
UI should start up.
Commit where the problem happens
c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
What platforms do you use to access UI ?
Linux
What browsers do you use to access the UI ?
No response
Command Line Arguments
Additional information, context and logs
The image id of the docker image I'm using is
614789dfdb38
.Find the dumped core here (2.8 GB): https://drive.google.com/file/d/1n-ulnrYZ1pjkF9xUJgYasCY3qk8rr5vW/view?usp=share_link