Closed geroldmeisinger closed 3 weeks ago
Additional guesses
I can somewhat consistently generate when I increase swap and run in normal vram mode which suggests a certain system OOM condition:
have you tried running the model in fp8? at fp16 (default) it's definitely too big for your VRAM and 32 GB SRAM isn't gonna save you either as it has to hold the text encoders, vae and diffusion model. Using swap to extend your SRAM to extend your VRAM is not a good idea imho. If it runs in fp8 then it's likely just a matter of not enough vram/sram and you might need to upgrade to run in fp16. i doubt there's much to be done on the software side, aside from general memory efficiency improvements.
Thanks for the suggestion. I tried --lowvram flux-dev fp8_e4m3fn
and --lowvram flux-dev-fp8
. Both crash(!). Which invalidates the OOM hypothesis and the faulty swap hypothesis. I'm on 0a6b0081176c6233015ec00d004c534c088ddcb0
now.
enough SRAM and VRAM left, seconds before the crash
Upgraded to 64GB SRAM now, same problem with --lowvram But --normalvram seems to be more stable now, Comfy does not get killed
still happens on 413322645e713bdda69836620a97d4c9ca66b230
with --lowvram
I have same issue. I have AMD Radeon 7800 xt 16GB vram. I have 32GB system ram and it starts using some swap but there's lots free.
I gave tried flux-dev-fp8
I often get hard lock where I need to hold power button to turn off, sometimes I can recover using altgr-rei
I found this to make some pytorch tracing. It should work on Nvidia as well. https://rocmdocs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#pytorch-profiler
I have digged through some error logs and it seems that zram was failing on when I had low system ram. Removing zram and just have a swap-file solved some crashes for me.
The only issue left now is that I get soft-lock with the kernel spamming this line:
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Then I found this page: https://bugzilla.kernel.org/show_bug.cgi?id=209163#c4
I guess there's some check if vram malloc call failure is not handled correctly. How to do that in python is outside my code-fu :/
Edit 1: I think I solved it using this kernel parameter: https://www.phoronix.com/forums/forum/software/linux-gaming/1304169-dxvk-1-9-4-released-with-better-support-for-god-of-war?p=1304262#post1304262
Edit2: Nope, it still happens, but I was able to do quite a lot of image generations before I got this error.
still happens on 2622c55aff9433d425a62e5f6c379cf22a42139e with --lowvram
Kernel panic means there's something wrong with your driver/OS config or hardware.
I always get OOMs (both vram and ram) sometimes on purpose when testing various things and have never had my kernel panic because of it.
It only happens in lowvram
, without it it works. With normalvram I too get OOM all the time without kernel panic.
I just tried reverting to cuda 12.3.2 and Driver Version: 545.23.08 (the oldest update number which supports debian 12). Same issue :/ I also checked the whole hardware for any issues (see first post), unless there is something missing.
sudo apt-get remove --purge nvidia-* libnvidia-* libxnvctrl* cuda*
sudo apt-get install -y cuda-drivers
nvidia-smi
Sun Aug 18 10:44:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
git pull
rm -fR venv
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py --lowvram --disable-all-custom-nodes
Intel 13th generation CPU bug
BIOS update, Robeytech test (Windows) all positive => same issue
Power consumption
I also ran Prime95+3DMark (Windows) for 5min => stable
did you find something ? I use forge and flux crash my PC or video card driver every second generation and also I need limit VRAM weight to 20G so can generate first image otherwise the forge crash it self also it works fine if I use fp8 model
13900K (I tested everything and it seems fine) 4090 64G
did you find something ? I use forge and flux crash my PC or video card driver every second generation and also I need limit VRAM weight to 20G so can generate first image otherwise the forge crash it self also it works fine if I use fp8 model
no, except it works with --normalvram
. for me fp8 doesn't work with --lowvram
. so there is something in the lowvram utilization of (any) flux model which causes the kernel to crash.
@rsl8 thanks for the hint btw, apparently I had my linux-image set to manual
Setup a fresh Debian 12.6 system
cat /proc/version
Linux version 6.1.0-23-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15)
(this is the latest stable linux kernel in debian)
same issue
echo "deb http://deb.debian.org/debian bookworm-backports main contrib non-free non-free-firmware" | sudo tee -a /etc/apt/sources.list
apt update
apt install -t bookworm-backports linux-image-amd64
# reboot
cat /proc/version
# Linux version 6.9.7+bpo-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.9.7-1~bpo12+1 (2024-07-03)
same issue
I think I found the problem and may be affected by the Intel Raptor Lake instability and degradation issue due to elevated operating voltage after all.
Before you do anything UPDATE YOUR BIOS OR YOU MAY DAMAGE YOUR CPU!
Update your BIOS before you do this and make sure it includes something like Update microcode 0x129 to address sporadic Vcore elevation behavior announced by Intel
.
The following models are affected:
13th gen:
i9-13900KS
i9-13900K
i9-13900KF
i9-13900F
i9-13900
i7-13700K
i7-13700KF
i7-13790F
i7-13700F
i7-13700
i5-13600K
i5-13600KF
14th gen:
i9-14900KS
i9-14900K
i9-14900KF
i9-14900F
i9-14900
i7-14700K
i7-14700KF
i7-14790F
i7-14700F
i7-14700
i5-14600K
i5-14600KF
Solution
Load a low-voltage profile in UEFI (I never tried this before because I assumed the BIOS defaults are fine):
If I use "e-core disable" I am able to run Flux on --lowvram
. It may be called differently by your mainboard vendor.
@geroldmeisinger I can pass all of the tests
can you describe your symptoms in more detail please!
Recently, when upgrading my computer, I avoided Intel because of that issue. I see that's how the symptoms manifest... You must have gone through a lot of trouble to identify the cause.
You must have gone through a lot of trouble to identify the cause.
Thanks for the empathy! Yes.
I tried multiple CPU benchmarks on Windows 10 with "E-Core disable" and "Spec Enhance" performance profiles:
GeekBench 6 errors and crashes on "Spec Enhance" (blue screen of death "Clock_Watchdog_Timeout"). Performance loss is about -20-25%(!) which is...
I called Intel support and they will exchange my processor.
Expected Behavior
I'm able to generate images with flux-dev and flux-schnell sometime but usually the whole computer crashes or comfy gets killed. I tried flux-dev and flux-schnell default workflows from ComfyUI with t5 fp8 (instead of fp16), everything else on default. ComfyUI was started with
--lowvram --disable-all-custom-nodes
. The crash usually happens when ComfyUI visually executesClipTextEncode
but when running it on it's own it doesn't seem to be the issue.Actual Behavior
--lowvram
)Killed
(with--normalvram
)Steps to Reproduce
Guesses
Diagnostics
Kernel panic
This is the most useful log. I opened a root-terminal on Ctrl+Alt+1 and a user-terminal on Ctrl+Alt+2 and desktop on Ctrl+Alt+7.
systemctl disable lightdm
(saves VRAM and SRAM)Unfortunately I wasn't able to make good screenshots but here are still frames (sorry for the quality but I had to take a video with smartphone):
I think this is the most relevant log
...and then it all happened so fast!
Temperature
I tried
sensors
andi7z
but temp is around 40-60°.Example (idle):
the seconds before disaster strikes
(Please note that I took a screenshot every other second so there is still the possibility of a huge spike just before the crash happens and I didn't catch it.. but there is only one core utilized so this is all unlikely. for the same reason I don't expect power supply to be the issue)
Out of memory
Maybe the system runs into OOM. I tried increasing swap from 1GB to 32GB and 64GB but it didn't help:
What I noticed however is that with
--lowvram
the VRAM doesn't get utilized at all (nvidia-smi
) before it crashes.Memtest86+
Maybe the SRAM is faulty? I ran memtest and it passed.
Maybe disk is faulty? TODO
Dmesg and other logs
I looked in various logs but found no useful info => omitted
Drivers
sudo apt-get remove --purge nvidia-* libnvidia-* libxnvctrl* cuda*
apt-get install -y nvidia-open
nvidia-smi
ComfyUI
Model files
sha256sum ...
are all correct.ClipTextEncode
I tried only running
ClipTextEncode
and hooked it up onComfyUI-essentials Debug Tensor
and this usually works(!)... which suggests the visual information "executing on ClipTextEncode" from ComfyUI might be misleading (it's actually somewhere else already) or there is some interaction when flux and clips are both loaded.Debug Logs
Other
System
(other notable info: system and comfy runs on NVME disk whereas the models are symlinked at a SATA SSD.)