Closed matoro closed 1 month ago
@matoro Internal ticket has been created to investigate this issue. Thanks!
Hi @matoro, thanks for the detailed issue report. I can confirm that I was able to reproduce your issue in some specific cases and am currently investigating. I'll detail my process so far.
Your output from dpkg-query --list '*rocm*'
matches mine, so there does not appear to be an issue with ROCm component versions. In order to arrive at as similar a configuration as possible, I first uninstalled torch, torchvision, and triton to make sure I'm pulling the same versions as you, then applied your script
grep -E "^[^#]" webui-user.sh
export COMMANDLINE_ARGS="--no-half-vae --opt-sdp-attention"
export TORCH_COMMAND="pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torch-2.1.2%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torchvision-0.16.1%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/pytorch_triton_rocm-2.1.0%2Brocm6.1.3.4d510c3a44-cp310-cp310-linux_x86_64.whl"
export TORCH_BLAS_PREFER_HIPBLASLT=0
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH=gfx1100
export HIP_VISIBLE_DEVICES=0
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
to set the environment variables and ran webui.sh. I was able to generate a crash in this configuration.
There are two things I'd like to suggest for now. One is to try running Stable Diffusion with --listen, and accessing the webui from another machine on your network; when I did this I was able to repeatedly generate images without crashing. By default you can access the headless web UI through <the host machine's IP>:7860.
The other suggestion I have is to try setting up a venv before setting the environment variables and running Stable Diffusion as per https://are-we-gfx1100-yet.github.io/post/a1111-webui/; if you're already doing this and didn't specify or I missed it, I apologize. However, this was inconsistent for me when using a local web UI (i.e. not using the --listen flag); it worked fine multiple times, then I crashed with a local web UI outside of a venv, and following that crash when I went back into the venv to invoke the local web UI again I crashed. This might be due to some corruption that occurs following a crash, or it may just be inconsistent; I'll have to test it further.
Thanks again for your detailed report, there are several reports of Stable Diffusion crashes on the 7900XTX and other cards that we haven't been able to address and you've brought us one step closer to a solution. Let me know if either of my suggestions help you in the short term so we can narrow down the root cause. If neither of these help, I'll try to reproduce your Ubuntu-inside-Gentoo configuration.
Hi @matoro, thanks for the detailed issue report. I can confirm that I was able to reproduce your issue in some specific cases and am currently investigating. I'll detail my process so far.
Your output from
dpkg-query --list '*rocm*'
matches mine, so there does not appear to be an issue with ROCm component versions. In order to arrive at as similar a configuration as possible, I first uninstalled torch, torchvision, and triton to make sure I'm pulling the same versions as you, then applied your scriptgrep -E "^[^#]" webui-user.sh export COMMANDLINE_ARGS="--no-half-vae --opt-sdp-attention" export TORCH_COMMAND="pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torch-2.1.2%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torchvision-0.16.1%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/pytorch_triton_rocm-2.1.0%2Brocm6.1.3.4d510c3a44-cp310-cp310-linux_x86_64.whl" export TORCH_BLAS_PREFER_HIPBLASLT=0 export HSA_OVERRIDE_GFX_VERSION=11.0.0 export PYTORCH_ROCM_ARCH=gfx1100 export HIP_VISIBLE_DEVICES=0 export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
to set the environment variables and ran webui.sh. I was able to generate a crash in this configuration.
There are two things I'd like to suggest for now. One is to try running Stable Diffusion with --listen, and accessing the webui from another machine on your network; when I did this I was able to repeatedly generate images without crashing. By default you can access the headless web UI through <the host machine's IP>:7860.
The other suggestion I have is to try setting up a venv before setting the environment variables and running Stable Diffusion as per https://are-we-gfx1100-yet.github.io/post/a1111-webui/; if you're already doing this and didn't specify or I missed it, I apologize. However, this was inconsistent for me when using a local web UI (i.e. not using the --listen flag); it worked fine multiple times, then I crashed with a local web UI outside of a venv, and following that crash when I went back into the venv to invoke the local web UI again I crashed. This might be due to some corruption that occurs following a crash, or it may just be inconsistent; I'll have to test it further.
Thanks again for your detailed report, there are several reports of Stable Diffusion crashes on the 7900XTX and other cards that we haven't been able to address and you've brought us one step closer to a solution. Let me know if either of my suggestions help you in the short term so we can narrow down the root cause. If neither of these help, I'll try to reproduce your Ubuntu-inside-Gentoo configuration.
Hey, thanks so much for looking at it.
First off, stable-diffusion already installs everything to a venv by default. I don't install anything outside of a venv.
Secondly that snippet wasn't supposed to be a script for setting the environment variables, it was supposed to be demonstrating that I had them set in webui-user.sh
, which webui.sh
sources on startup. That regex simply prints out all lines which are not commented out. If you do the same thing, then grep -E "^[^#]" webui-user.sh
should print the same list of exports
. That way, you won't have to set those variables every time you run, and you can tweak them before restarting stable-diffusion.
For posterity's sake, I just tried accessing the UI remotely using --listen
, and it still crashed my card. I can't imagine that has anything to do with it.
Besides - if stock stable-diffusion crashes at all, that is a problem. If the output looks the same, both on the terminal and in dmesg
, then it's probably the same issue. Or are you saying that you observed crashes, but with different symptoms?
Hi, thanks for the quick response.
You're correct that Stable Diffusion installs and runs in a venv by default. The behaviour I experienced locally was that manually starting the venv before running fixed the crashes. However, frustratingly, so far today I am unable to reproduce any crashes with --no-half-vae enabled, regardless of manual venv or --listen flag. Running without --no-half-vae still crashes with the same dmesg output you're observing, but as you're crashing with this flag this doesn't seem to be causing your issue.
The next step on my end is to attempt reproduction inside a container in order to match your configuration better. This may be delayed a bit, but I'll update you when I have it set up. In the meantime, can you post the full output from running webui.sh? Also, I apologize if this seems redundant, but just to cover all the bases make sure the environment variables are actually being set and that the correct versions of torch, torchvision, and triton are installed.
I agree that stock Stable Diffusion crashing is an issue, the hope is that finding a consistent workaround will point toward the root cause.
Sure, here is the complete output when running and then attempting to generate an image. I do not change any settings or put anything in the prompt.
$ ./webui.sh
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################
################################################################
Running on matoro user
################################################################
################################################################
Repo already cloned, using it as install directory
################################################################
################################################################
Create and activate python venv
################################################################
################################################################
Launching launch.py...
################################################################
glibc version is 2.35
Check TCMalloc: libtcmalloc_minimal.so.4
libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0]
Version: v1.9.4-295-g9f5a98d5
Commit hash: 9f5a98d5766b4ac233d916fe8b02ea16b8b2c259
Launching Web UI with arguments: --no-half-vae --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Loading weights [6ce0161689] from /var/tmp/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Startup time: 6.3s (prepare environment: 1.5s, import torch: 1.7s, import gradio: 0.5s, setup paths: 1.5s, other imports: 0.3s, load scripts: 0.3s, create ui: 0.4s).
Creating model from config: /var/tmp/stable-diffusion-webui/configs/v1-inference.yaml
/var/tmp/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Applying attention optimization: sdp... done.
Model loaded in 2.2s (load weights from disk: 0.5s, create model: 0.3s, apply weights to model: 1.1s, calculate empty prompt: 0.2s).
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 17.96it/s]
Memory access fault by GPU node-1 (Agent handle: 0x563187d2e580) on address 0x7fe6b8600000. Reason: Page not present or supervisor privilege.s]
./webui.sh: line 304: 1142 Aborted "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"
Here are the complete contents of my webui-user.sh
:
$ cat webui-user.sh
#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################
# Install directory without trailing slash
#install_dir="/home/$(whoami)"
# Name of the subdirectory
#clone_dir="stable-diffusion-webui"
# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="--no-half-vae --opt-sdp-attention"
# python3 executable
#python_cmd="python3"
# git executable
#export GIT="git"
# python3 venv without trailing slash (defaults to ${install_dir}/${clone_dir}/venv)
#venv_dir="venv"
# script to launch to start the app
#export LAUNCH_SCRIPT="launch.py"
# install command for torch
#export TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1"
export TORCH_COMMAND="pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torch-2.1.2%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torchvision-0.16.1%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/pytorch_triton_rocm-2.1.0%2Brocm6.1.3.4d510c3a44-cp310-cp310-linux_x86_64.whl"
# Requirements file to use for stable-diffusion-webui
#export REQS_FILE="requirements_versions.txt"
# Fixed git repos
#export K_DIFFUSION_PACKAGE=""
#export GFPGAN_PACKAGE=""
# Fixed git commits
#export STABLE_DIFFUSION_COMMIT_HASH=""
#export CODEFORMER_COMMIT_HASH=""
#export BLIP_COMMIT_HASH=""
# Uncomment to enable accelerated launch
#export ACCELERATE="True"
# Uncomment to disable TCMalloc
#export NO_TCMALLOC="True"
export TORCH_BLAS_PREFER_HIPBLASLT=0
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH=gfx1100
export HIP_VISIBLE_DEVICES=0
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
###########################################
i had similar crashes before, but on a different gfx version, when it reaches the final rendering before it renders the image the graphic card goes into limbo, the problem for my case was overheating, I tried adjust voltage/clocks in my case and the issue disappeared without changing any packages. But it could be a different scenario on your case, it would be helpful if you try to run again with a higher AMD_LOG_LEVEL=1 OR 2 to see whats going on.
i had similar crashes before, but on a different gfx version, when it reaches the final rendering before it renders the image the graphic card goes into limbo, the problem for my case was overheating, I tried adjust voltage/clocks in my case and the issue disappeared without changing any packages. But it could be a different scenario on your case, it would be helpful if you try to run again with a higher AMD_LOG_LEVEL=1 OR 2 to see whats going on.
Sure, here's a log of the issue with AMD_LOG_LEVEL=2
set: out.log
I wouldn't expect it to be overheating-related, because I am using completely stock settings on my card with no overclocking, and further I am watching temps the entire time and never see it break 60C. And I am not using anything intense, I am observing the crash while generating a single 512x512 image with stock settings on SD. What settings did you change exactly?
I think you can also try level 3, that would output a lot more stuff, but its always interesting to see more traces, but I think these missing functions are not the real cause for you, but it may be good some good guy have a look into that :)
What kernel version do you currently run? it may be related to the amdgpu kernel driver
I wouldn't expect it to be overheating-related, because I am using completely stock settings on my card with no overclocking, and further I am watching temps the entire time and never see it break 60C. And I am not using anything intense, I am observing the crash while generating a single 512x512 image with stock settings on SD. What settings did you change exactly?
overheating can happen in a glimpse of a second and it may not be a real over heat but a hysteresis either hardware or software related (also monitor per second the status use rocm-smi
), for me down clocking the voltage and memory clock have helped maintain a consistent stable output.
i dont think that pruned model is more than 4 GB in size and that image generation should not take more than 2GB so in total you are using around 5-7GB so that out of memory is just a crashing procedure (maybe wrong address translation or any similar memory related issue) and not really an out of memory issue. but also driver related, rocm related, pytorch, etc all possibilities are open.
I think you can also try level 3, that would output a lot more stuff, but its always interesting to see more traces, but I think these missing functions are not the real cause for you, but it may be good some good guy have a look into that :)
Sure, the log was too large to upload to github so here is a log capture taken with AMD_LOG_LEVEL=3
. I noticed also that this had some c++-mangled names in it, so I ran it through c++filt
as well.
What kernel version do you currently run? it may be related to the amdgpu kernel driver
This most recent attempt is on kernel 6.10.2. I have never been able to successfully generate an image or text on my card with any settings on any kernel. Using it for compute in any capacity instantly causes this crash.
I wouldn't expect it to be overheating-related, because I am using completely stock settings on my card with no overclocking, and further I am watching temps the entire time and never see it break 60C. And I am not using anything intense, I am observing the crash while generating a single 512x512 image with stock settings on SD. What settings did you change exactly?
overheating can happen in a glimpse of a second and it may not be a real over heat but a hysteresis either hardware or software related (also monitor per second the status use
rocm-smi
), for me down clocking the voltage and memory clock have helped maintain a consistent stable output.
Can you share exactly what settings you used? I'm just tired of randomly flipping settings only to get the same results over and over.
i dont think that pruned model is more than 4 GB in size and that image generation should not take more than 2GB so in total you are using around 5-7GB so that out of memory is just a crashing procedure (maybe wrong address translation or any similar memory related issue) and not really an out of memory issue. but also driver related, rocm related, pytorch, etc all possibilities are open.
I am using the exact ROCm packages and pytorch packages that were specified by @schung-amd , an AMD employee, in https://github.com/ROCm/ROCm/issues/3265#issuecomment-2245358795 . I can't imagine how this could be any closer to textbook at least from the userspace side.
Sure, the log was too large to upload to github so here is a log capture taken with
AMD_LOG_LEVEL=3
. I noticed also that this had some c++-mangled names in it, so I ran it throughc++filt
as well.
I think the card goes into limbo and its not any more an accessible resource at the point where it says out of memory or privileges. what does rocm-smi reports after its dead, does it actually have the card listed with all its correct parameters values, like temp, vram etc?
I saw that you wrote the card remains functional after the crash but is it really functional or its just have a framebuffer capability and not ROCm, or it can start the script without any problem and then crashes at 20/20?
This most recent attempt is on kernel 6.10.2. I have never been able to successfully generate an image or text on my card with any settings on any kernel. Using it for compute in any capacity instantly causes this crash.
So basically you tried earlier versions like 6.4?
Can you share exactly what settings you used? I'm just tired of randomly flipping settings only to get the same results over and over.
I actually did not understand which settings should I share :), the kernel parameters settings, or the DF parameters settings or which settings? also regarding kernel parameters are using any?
Sure, the log was too large to upload to github so here is a log capture taken with
AMD_LOG_LEVEL=3
. I noticed also that this had some c++-mangled names in it, so I ran it throughc++filt
as well.I think the card goes into limbo and its not any more an accessible resource at the point where it says out of memory or privileges. what does rocm-smi reports after its dead, does it actually have the card listed with all its correct parameters values, like temp, vram etc?
I saw that you wrote the card remains functional after the crash but is it really functional or its just have a framebuffer capability and not ROCm, or it can start the script without any problem and then crashes at 20/20?
Yes, as soon as the driver resets the card (after the line that says amdgpu: GPU reset(3) succeeded!
) then it resumes being fully functional, except that the Xorg session freezes. It continues displaying a frozen image of whatever was on screen at the time of the crash. If I ssh in and kill the Xorg session and start a new one, the new session is fully functional.
This most recent attempt is on kernel 6.10.2. I have never been able to successfully generate an image or text on my card with any settings on any kernel. Using it for compute in any capacity instantly causes this crash.
So basically you tried earlier versions like 6.4?
I've periodically tried this at least once on all major kernel releases since support was added, which is a pretty recent version. I don't think 7900XTX was supported as far back as kernel 6.4. Regardless, I've tried on 6.9 and 6.10 at least in this thread.
Can you share exactly what settings you used? I'm just tired of randomly flipping settings only to get the same results over and over.
I actually did not understand which settings should I share :), the kernel parameters settings, or the DF parameters settings or which settings? also regarding kernel parameters are using any?
My full kernel command line is:
ro debug amd_iommu=pt pcie_acs_override=downstream rd.driver.pre=vfio-pci mitigations=off transparent_hugepage=never rcu_nocbs=8-15,24-31 nohz_full=8-15,24-31 clocksource=tsc acpi_enforce_resources=lax amdgpu.ppfeaturemask=0xffffffff amdgpu.mcbp=0
Some of those are related to running virtual machines - since my 7900XTX does not work for compute, I pass through an NVIDIA card to a virtual machine. The last two items especially (amdgpu.ppfeaturemask=0xffffffff amdgpu.mcbp=0
) were recommended by other users who reported that it solved their problems, but neither of them has worked for me.
You said earlier:
for me down clocking the voltage and memory clock have helped maintain a consistent stable output. That is what I am looking for. Can you please specify what voltage and memory clock you set?
sure, the voltages and clocks can be adjusted using normal cat and echo command lines on the sys/fs files, here is a repo that does that in a nice script, thanks to this guy:) https://github.com/sibradzic/amdgpu-clocks if you need the actual values for the clock and voltages then my numbers are not going to help you because they are not for RX7900X but what you can do, is look into the current table and try to down it by a factor of 10%
The boot params looks good, and I think just for a temporary test, which is not really good (because it disables the dynamic power management of the card) but try it to see if that also crashes amdgpu.dpm=0
I recommend removing the rest of the params except the ppfeaturemask and amd_iommu to make sure nothing is causing out of control behavior which is mostly not the case.
Just wanted to chime in and confirm that I'm having the same issue (again). It used to work, but suddenly when I tried to use Stable Diffusion again at a later time, it had stopped working and is crashing my X11 session (again), which I'm unable to recover without a reboot.
I say again, because it really likes to break out of a sudden every now and then. I then somehow manage to get it back to a working state after a while and lots of fiddling and try&error, only for it to break again at a later time for whatever reason. It's super fragile somehow, which is frustrating because I don't know why. :|
syslog: same as OP, GPU: same as OP OS: Ubuntu 22.04.4 LTS Kernel: 6.9.9 ROCm: both 6.1.2 and 6.2
ii rocm-core 6.2.0.60200-66~22.04 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-device-libs 1.0.0.60200-66~22.04 amd64 Radeon Open Compute - device libraries
ii rocm-llvm 18.0.0.24292.60200-66~22.04 amd64 ROCm core compiler
ii rocm-smi-lib 7.3.0.60200-66~22.04 amd64 AMD System Management libraries
ii rocminfo 1.0.0.60200-66~22.04 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
ii amdgpu-core 1:6.2.60200-2009582.22.04 all Core meta package for unified amdgpu driver.
ii amdgpu-install 6.2.60200-2009582.22.04 all AMDGPU driver repository and installer
ii libdrm-amdgpu-amdgpu1:amd64 1:2.4.120.60200-2009582.22.04 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm-amdgpu-common 1.0.0.60200-2009582.22.04 all List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii libdrm-amdgpu-dev:amd64 1:2.4.120.60200-2009582.22.04 amd64 Userspace interface to kernel DRM services -- development files
ii libdrm-amdgpu-radeon1:amd64 1:2.4.120.60200-2009582.22.04 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime
ii libdrm-amdgpu1:amd64 2.4.120-1~kisak~j amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm-amdgpu1:i386 2.4.120-1~kisak~j i386 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm2-amdgpu:amd64 1:2.4.120.60200-2009582.22.04 amd64 Userspace interface to kernel DRM services -- runtime
ii ricks-amdgpu-utils 3.6.0-2 all transitional package
ii xserver-xorg-video-amdgpu 22.0.0-1ubuntu0.2 amd64 X.Org X server -- AMDGPU display driver
export PATH="/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/rocm/lib/"
export ROCM_VERSION=6.2
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HCC_AMDGPU_TARGET=gfx1100
export GPU_MAX_HW_QUEUES=1
export TORCH_BLAS_PREFER_HIPBLASLT=0
open-clip-torch 2.20.0
pytorch-lightning 1.9.4
pytorch-triton-rocm 3.0.0+01cbe5045a
torch 2.5.0.dev20240616+rocm6.1
torchaudio 2.4.0.dev20240616+rocm6.1
torchcrepe 0.0.20
torchdiffeq 0.2.3
torchfcpe 0.0.4
torchmetrics 1.2.0
torchsde 0.2.6
torchvision 0.19.0.dev20240616+rocm6.1
quiet splash amdgpu.ppfeaturemask=0xfff7ffff amd_iommu=on iommu=pt
PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:128 ./webui-my.sh --opt-split-attention --api --no-half-vae
Note: webui-my.sh
is a copy of an older version of the original start script, where I've modified it to detect my GPU properly
I've tried a fresh install of automatic1111 and this one did not crash, but it used different options than I do:
--precision full --no-half
With these it does not crash and I remember that I had to use --no-half
quite a while ago to not make it crash, but later at some point I could drop it without issues. Looks like I now again need this option.
Anyways, with this line it no longer crashes on my existing installation of automatic1111:
PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:128 ./webui-my.sh --opt-split-attention --api --no-half-vae --precision full --no-half
The rest stays as described above. I'll see how it goes, at least it doesn't crash immediately now, which I'd call progress. :)
Thank you for this, the above was enough to finally get a session where I was able to generate images.
After I confirmed that it was successfully stable, I tried to reduce the options one-by-one to come up with a set of options which would reliably reproduce the problem. However, what I found out instead is that the issue is not reproducible on-demand. Some option sets would crash, so I would remove an option and try again, get no crash, then go back to the one that originally showed it and get no crash. At least I did observe that if it crashed, it would do so on the first image generation. If it generated a single image and didn't crash, then generating further images regardless of model would also not crash.
I was able to at one point completely empty my COMMANDLINE_ARGS
variable and not get a crash. The best option seems to be as above:
export COMMANDLINE_ARGS="--opt-split-attention --no-half-vae --precision full --no-half"
as I never got a crash on this configuration. Unfortunately, seems that since the problem is intermittent when changing options, and the same sets of options do not always trigger it, it might be hard to track down.
Additionally there's no guarantee that it's not a stateful issue, where switching the options itself is what triggers the problem.
Thanks for following up on this, I'm glad you were able to get things working. @smirgol I noticed you're on ROCm 6.2 now; out of curiosity, do you crash without --no-half
on ROCm 6.1.x, or is this behavior new to ROCm 6.2? Our internal reproduction of the crash on ROCm 6.1.3 was fixed with --no-half-vae
alone, but I wonder if we were just "fortunate" in not requiring --no-half
for some reason, and in retrospect requiring both makes sense as both are related to FP16 support. Stable Diffusion crashes on the 7900XTX have been reported from time to time and we've had a hard time reproducing the crashes internally, so this is a good data point.
Looking at the A1111 repo, it seems like both --no-half
and --no-half-vae
are known to be necessary on some AMD cards (e.g. https://github.com/vladmandic/automatic/discussions/1041, also see --no-half
requirement stated for some AMD cards in https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs), but it appears these flags were automatically set at some point and are no longer, so some of the intermittent nature of this issue might be due to A1111 version differences.
@matoro If the issue appears to be solved on your end after a while, please close this issue, or I'll close this if I don't hear any further issues by the end of the week; but feel free to reopen or make a new issue if you start experiencing these crashes again. It seems like half precision is a common pain point for Stable Diffusion configurations in general, and I don't know if an internal solution is possible.
Problem Description
As requested in https://github.com/ROCm/ROCm/issues/3265 , I'm opening this separate issue since the recommendations listed there did not resolve the issue.
I am running this in an Ubuntu container, but the host and thus kernel is Gentoo. Kernel is
6.9.9-gentoo-dist
.Attempting to generate images in Stable Diffusion crashes the card. The graphical session is recoverable by remotely killing and restarting the X server.
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD Ryzen 9 3950X 16-Core Processor
GPU
AMD Radeon RX 7900 XTX
ROCm Version
ROCm 6.1.0
ROCm Component
No response
Steps to Reproduce
Followed the exact steps as specified in https://github.com/ROCm/ROCm/issues/3265#issuecomment-2245358795
Installed ROCm 6.1.3:
Clone stable diffusion repo. Make changes as specified:
Run and attempt to generate an image.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
Stable Diffusion reports:
The following errors are emitted from the kernel: