comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
51k stars 5.35k forks source link

Flux - CPU stall, Kernel panic, Computer crash, ComfyUI killed #4198

Closed geroldmeisinger closed 3 weeks ago

geroldmeisinger commented 1 month ago

Expected Behavior

I'm able to generate images with flux-dev and flux-schnell sometime but usually the whole computer crashes or comfy gets killed. I tried flux-dev and flux-schnell default workflows from ComfyUI with t5 fp8 (instead of fp16), everything else on default. ComfyUI was started with --lowvram --disable-all-custom-nodes. The crash usually happens when ComfyUI visually executes ClipTextEncode but when running it on it's own it doesn't seem to be the issue.

Actual Behavior

  1. ComfyUI executes ClipTextEncode, after a while computer hangs for 3sec, then automatically reboots (with --lowvram)
  2. ComfyUI Terminal: Killed (with --normalvram)
  3. Computer stutters for a while but image gets generated

Steps to Reproduce

Guesses

  1. Out of memory
  2. Faulty hardware
  3. CPU or GPU Temperature
  4. Driver issue
  5. Comfy out of date
  6. Model files corrupt

Diagnostics

Kernel panic

This is the most useful log. I opened a root-terminal on Ctrl+Alt+1 and a user-terminal on Ctrl+Alt+2 and desktop on Ctrl+Alt+7.

  1. On user terminal start ComfyUI
  2. On desktop queue the prompt then close browser
  3. On root disable desktop manager systemctl disable lightdm (saves VRAM and SRAM)
  4. Go back to user terminal and watch ComfyUI

Unfortunately I wasn't able to make good screenshots but here are still frames (sorry for the quality but I had to take a video with smartphone):

1

I think this is the most relevant log

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: $16-...0: (8 ticks this GP) idle=108c/1/0x40000000000000 softirq=4161/4165 fqs=2396
mce: CPUs not responding to MCE broadcast (may include false positives): 16
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

2 3 4

...and then it all happened so fast!

Temperature

I tried sensors and i7z but temp is around 40-60°.

Example (idle):

Cpu speed from cpuinfo 2111.00Mhz
cpuinfo might be wrong if cpufreq is enabled. To guess correctly try estimating via tsc
Linux's inbuilt cpu_khz code emulated now
True Frequency (without accounting Turbo) 2111 MHz
  CPU Multiplier 21x || Bus clock frequency (BCLK) 100.52 MHz

Socket [0] - [physical cores=16, logical cores=24, max online cores ever=16]
  TURBO ENABLED on 16 Cores, Hyper Threading ON
  Max Frequency without considering Turbo 2211.52 MHz (100.52 x [22])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 Cores is  52x/52x/51x/51x/51x/51x
  Real Current Frequency 1654.28 MHz [100.52 x 16.46] (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp      VCore
        Core 1 [0]:       1208.93 (12.03x)         1    95.5       0    4.12    34      0.7916
        Core 2 [2]:       1008.66 (10.03x)         1     100       0       0    30      0.7916
        Core 3 [4]:       1062.96 (10.57x)         1    98.9       0       1    33      0.7916
        Core 4 [6]:       1103.47 (10.98x)         1    99.9       0       0    34      0.7916
        Core 5 [8]:       1315.81 (13.09x)      8.07    92.1       0    2.85    34      0.7866
        Core 6 [10]:      1654.28 (16.46x)      8.72    84.5       0    8.67    32      0.7866
        Core 7 [12]:      1151.81 (11.46x)      7.15    90.8       0    5.32    33      0.7863
        Core 8 [14]:      1081.30 (10.76x)      1.38    97.3       0    2.02    34      0.7913
        Core 9 [16]:      1049.93 (10.44x)         1    5.68       0    94.3    36      0.7863
        Core 10 [17]:     1195.70 (11.89x)         1    1.75       0    98.1    36      0.7863
        Core 11 [18]:     1080.99 (10.75x)      3.28    12.7       0    85.6    36      0.7863
        Core 12 [19]:     1124.96 (11.19x)         1    6.19       0    93.8    36      0.7863
        Core 13 [20]:     1148.47 (11.42x)         1    2.13       0    97.4    34      0.7863
        Core 14 [21]:     1200.57 (11.94x)      2.89    3.21       0    95.1    34      0.7863
        Core 15 [22]:     892.17 (8.88x)           1    2.27       0    97.7    34      0.7863
        Core 16 [23]:     1101.13 (10.95x)         1    1.58       0    98.3    34      0.7863

1

the seconds before disaster strikes

2

(Please note that I took a screenshot every other second so there is still the possibility of a huge spike just before the crash happens and I didn't catch it.. but there is only one core utilized so this is all unlikely. for the same reason I don't expect power supply to be the issue)

Out of memory

Maybe the system runs into OOM. I tried increasing swap from 1GB to 32GB and 64GB but it didn't help:

sudo dd if=/dev/zero of=/swapfile bs=1M count=32768
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapoff /dev/dm-2
sudo swapon /swapfile

nano /etc/fstab
/swapfile swap swap defaults 0 0

What I noticed however is that with --lowvram the VRAM doesn't get utilized at all (nvidia-smi) before it crashes.

Memtest86+

Maybe the SRAM is faulty? I ran memtest and it passed.

memtest

Maybe disk is faulty? TODO

Dmesg and other logs

I looked in various logs but found no useful info => omitted

/var/log/syslog
/var/log/messages
/var/log/kern.log
$ journalctl -b -1

Drivers

  1. sudo apt-get remove --purge nvidia-* libnvidia-* libxnvctrl* cuda*
  2. Reboot
  3. Install cuda
  4. Install driver: apt-get install -y nvidia-open
  5. Reboot
  6. nvidia-smi
Sun Aug  4 10:42:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
...

ComfyUI

git pull
# f7a5107784cded39f92a4bb7553507575e78edbe
rm -fR venv
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Model files

  1. sha256sum ... are all correct.
  2. It works sometime... so they should be fine.

ClipTextEncode

I tried only running ClipTextEncode and hooked it up on ComfyUI-essentials Debug Tensor and this usually works(!)... which suggests the visual information "executing on ClipTextEncode" from ComfyUI might be misleading (it's actually somewhere else already) or there is some interaction when flux and clips are both loaded.

Debug Logs

python main.py --lowvram --disable-all-custom-nodes
Total VRAM 15971 MB, total RAM 31923 MB
pytorch version: 2.4.0+cu121
Set vram state to: LOW_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: /home/meisi/dev/ComfyUI/web
Adding extra search path checkpoints ./models/Stable-diffusion
Adding extra search path configs ./models/Stable-diffusion
Adding extra search path vae ./models/VAE
Adding extra search path loras ./models/Lora
Adding extra search path loras ./models/LyCORIS
Adding extra search path upscale_models ./models/ESRGAN
Adding extra search path upscale_models ./models/RealESRGAN
Adding extra search path upscale_models ./models/SwinIR
Adding extra search path embeddings ./embeddings
Adding extra search path hypernetworks ./models/hypernetworks
Adding extra search path controlnet ./models/ControlNet
/home/meisi/dev/ComfyUI/venv/lib/python3.11/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
Skipping loading of custom nodes
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type FLOW
model weight dtype torch.bfloat16, manual cast: None
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model

Killed (or crash)

Other

System

nvidia-smi
cat /proc/version
cat /etc/*-release
System Info
OS: Debian Linux 12.2 (bookworm)
CPU: 13th Gen Intel Core i7-13700F x 16
GPU: NVIDIA GeForce RTX 4060 Ti
VRAM: 16GB
CUDA version: 12.6
Driver version: 560.28.03
SRAM: 32GB

(other notable info: system and comfy runs on NVME disk whereas the models are symlinked at a SATA SSD.)

geroldmeisinger commented 1 month ago

Additional guesses

I can somewhat consistently generate when I increase swap and run in normal vram mode which suggests a certain system OOM condition:

  1. in lowvram mode even if swap is increased to 64GB the VRAM is not utilized at all and maybe swap is too slow(?) so the whole system runs into kernel panic
  2. in normal vram mode if I keep swap at 1GB the VRAM is utilized, but with 1GB swap it might just be at the edge. If some other programs require more RAM comfyui gets killed. another thing I notice is the whole system begins to stutter when I reopen the browser which suggests heavy swapping. if I increase swap to just 4GB I can reopen the browser normally (this is also my solution for now)
kanttouchthis commented 1 month ago

have you tried running the model in fp8? at fp16 (default) it's definitely too big for your VRAM and 32 GB SRAM isn't gonna save you either as it has to hold the text encoders, vae and diffusion model. Using swap to extend your SRAM to extend your VRAM is not a good idea imho. If it runs in fp8 then it's likely just a matter of not enough vram/sram and you might need to upgrade to run in fp16. i doubt there's much to be done on the software side, aside from general memory efficiency improvements.

geroldmeisinger commented 1 month ago

Thanks for the suggestion. I tried --lowvram flux-dev fp8_e4m3fn and --lowvram flux-dev-fp8. Both crash(!). Which invalidates the OOM hypothesis and the faulty swap hypothesis. I'm on 0a6b0081176c6233015ec00d004c534c088ddcb0 now.

fp8

enough SRAM and VRAM left, seconds before the crash

geroldmeisinger commented 1 month ago

Upgraded to 64GB SRAM now, same problem with --lowvram But --normalvram seems to be more stable now, Comfy does not get killed

geroldmeisinger commented 1 month ago

still happens on 413322645e713bdda69836620a97d4c9ca66b230 with --lowvram

hartmark commented 1 month ago

I have same issue. I have AMD Radeon 7800 xt 16GB vram. I have 32GB system ram and it starts using some swap but there's lots free.

I gave tried flux-dev-fp8

I often get hard lock where I need to hold power button to turn off, sometimes I can recover using altgr-rei

I found this to make some pytorch tracing. It should work on Nvidia as well. https://rocmdocs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#pytorch-profiler

hartmark commented 1 month ago

I have digged through some error logs and it seems that zram was failing on when I had low system ram. Removing zram and just have a swap-file solved some crashes for me.

The only issue left now is that I get soft-lock with the kernel spamming this line: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

Then I found this page: https://bugzilla.kernel.org/show_bug.cgi?id=209163#c4

I guess there's some check if vram malloc call failure is not handled correctly. How to do that in python is outside my code-fu :/

Edit 1: I think I solved it using this kernel parameter: https://www.phoronix.com/forums/forum/software/linux-gaming/1304169-dxvk-1-9-4-released-with-better-support-for-god-of-war?p=1304262#post1304262

Edit2: Nope, it still happens, but I was able to do quite a lot of image generations before I got this error.

geroldmeisinger commented 1 month ago

still happens on 2622c55aff9433d425a62e5f6c379cf22a42139e with --lowvram

comfyanonymous commented 1 month ago

Kernel panic means there's something wrong with your driver/OS config or hardware.

I always get OOMs (both vram and ram) sometimes on purpose when testing various things and have never had my kernel panic because of it.

geroldmeisinger commented 1 month ago

It only happens in lowvram, without it it works. With normalvram I too get OOM all the time without kernel panic. I just tried reverting to cuda 12.3.2 and Driver Version: 545.23.08 (the oldest update number which supports debian 12). Same issue :/ I also checked the whole hardware for any issues (see first post), unless there is something missing.

  1. sudo apt-get remove --purge nvidia-* libnvidia-* libxnvctrl* cuda*
  2. Reboot
  3. Install cuda 12.3.2
  4. Install driver: sudo apt-get install -y cuda-drivers
  5. Reboot
  6. nvidia-smi
Sun Aug 18 10:44:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
git pull
rm -fR venv
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py --lowvram --disable-all-custom-nodes
geroldmeisinger commented 1 month ago

Intel 13th generation CPU bug

BIOS update, Robeytech test (Windows) all positive => same issue

Power consumption

I also ran Prime95+3DMark (Windows) for 5min => stable

hsinyu-chen commented 1 month ago

did you find something ? I use forge and flux crash my PC or video card driver every second generation and also I need limit VRAM weight to 20G so can generate first image otherwise the forge crash it self also it works fine if I use fp8 model

13900K (I tested everything and it seems fine) 4090 64G

geroldmeisinger commented 1 month ago

did you find something ? I use forge and flux crash my PC or video card driver every second generation and also I need limit VRAM weight to 20G so can generate first image otherwise the forge crash it self also it works fine if I use fp8 model

no, except it works with --normalvram. for me fp8 doesn't work with --lowvram. so there is something in the lowvram utilization of (any) flux model which causes the kernel to crash.

geroldmeisinger commented 4 weeks ago

@rsl8 thanks for the hint btw, apparently I had my linux-image set to manual

Setup a fresh Debian 12.6 system

cat /proc/version
Linux version 6.1.0-23-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15)

(this is the latest stable linux kernel in debian)

same issue

geroldmeisinger commented 3 weeks ago
echo "deb http://deb.debian.org/debian bookworm-backports main contrib non-free non-free-firmware" | sudo tee -a /etc/apt/sources.list
apt update
apt install -t bookworm-backports linux-image-amd64
# reboot
cat /proc/version
# Linux version 6.9.7+bpo-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.9.7-1~bpo12+1 (2024-07-03)

same issue

geroldmeisinger commented 3 weeks ago

I think I found the problem and may be affected by the Intel Raptor Lake instability and degradation issue due to elevated operating voltage after all.

Before you do anything UPDATE YOUR BIOS OR YOU MAY DAMAGE YOUR CPU!

Update your BIOS before you do this and make sure it includes something like Update microcode 0x129 to address sporadic Vcore elevation behavior announced by Intel.

hang

The following models are affected:

13th gen:
i9-13900KS
i9-13900K
i9-13900KF
i9-13900F
i9-13900
i7-13700K
i7-13700KF
i7-13790F
i7-13700F
i7-13700
i5-13600K
i5-13600KF

14th gen:
i9-14900KS
i9-14900K
i9-14900KF
i9-14900F
i9-14900
i7-14700K
i7-14700KF
i7-14790F
i7-14700F
i7-14700
i5-14600K
i5-14600KF

Solution

Load a low-voltage profile in UEFI (I never tried this before because I assumed the BIOS defaults are fine): e-core disabled If I use "e-core disable" I am able to run Flux on --lowvram. It may be called differently by your mainboard vendor.

hsinyu-chen commented 3 weeks ago

@geroldmeisinger I can pass all of the tests

geroldmeisinger commented 3 weeks ago

can you describe your symptoms in more detail please!

ltdrdata commented 3 weeks ago

Recently, when upgrading my computer, I avoided Intel because of that issue. I see that's how the symptoms manifest... You must have gone through a lot of trouble to identify the cause.

geroldmeisinger commented 3 weeks ago

You must have gone through a lot of trouble to identify the cause.

Thanks for the empathy! Yes.

geroldmeisinger commented 3 weeks ago

I tried multiple CPU benchmarks on Windows 10 with "E-Core disable" and "Spec Enhance" performance profiles:

Screenshot from 2024-08-24 12-37-23

GeekBench 6 errors and crashes on "Spec Enhance" (blue screen of death "Clock_Watchdog_Timeout"). Performance loss is about -20-25%(!) which is...

image

I called Intel support and they will exchange my processor.