Flux Controlnet crashes GPU

jussitus commented 3 months ago

Expected Behavior

.

Actual Behavior

Using the InstantX Canny controlnet fills up my VRAM at the KSampler step and freezes my system (Fedora, running Comfy in a podman container). I can prevent the crash by using --reserve-vram 4.0 (the InstantX cnet is ~3GB)

Steps to Reproduce

.

Debug Logs

This happens without custom nodes too but this is the only log I could get

## ComfyUI-Manager: installing dependencies done.
[2024-08-29 10:30] ** ComfyUI startup time: 2024-08-29 10:30:46.045919
[2024-08-29 10:30] ** Platform: Linux
[2024-08-29 10:30] ** Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0]
[2024-08-29 10:30] ** Python executable: /home/runner/ComfyUI/venv/bin/python
[2024-08-29 10:30] ** ComfyUI Path: /home/runner/ComfyUI
[2024-08-29 10:30] ** Log path: /home/runner/comfyui.log
[2024-08-29 10:30]
Prestartup times for custom nodes:
[2024-08-29 10:30]    0.6 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI-Manager
[2024-08-29 10:30]
Total VRAM 16368 MB, total RAM 63244 MB
[2024-08-29 10:30] pytorch version: 2.5.0.dev20240826+rocm6.1
[2024-08-29 10:30] Set vram state to: LOW_VRAM
[2024-08-29 10:30] Device: cuda:0 AMD Radeon RX 7800 XT : native
[2024-08-29 10:30] Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --use-split-cross-attention
[2024-08-29 10:30] [Prompt Server] web root: /home/runner/ComfyUI/web
[2024-08-29 10:30] /home/runner/ComfyUI/venv/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
[2024-08-29 10:30] ### Loading: ComfyUI-Impact-Pack (V7.3.2)
[2024-08-29 10:30] ### Loading: ComfyUI-Impact-Pack (Subpack: V0.6)
[2024-08-29 10:30] [Impact Pack] Wildcards loading done.
[2024-08-29 10:30] [comfyui_controlnet_aux] | INFO -> Using ckpts path: /home/runner/ComfyUI/custom_nodes/comfyui_controlnet_aux/ckpts
[2024-08-29 10:30] [comfyui_controlnet_aux] | INFO -> Using symlinks: False
[2024-08-29 10:30] [comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider', 'DirectMLExecutionProvider', 'OpenVINOExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider', 'CoreMLExecutionProvider']
[2024-08-29 10:30] /home/runner/ComfyUI/custom_nodes/comfyui_controlnet_aux/node_wrappers/dwpose.py:26: UserWarning: DWPose: Onnxruntime not found or doesn't come with acceleration providers, switch to OpenCV with CPU device. DWPose might run very slowly
  warnings.warn("DWPose: Onnxruntime not found or doesn't come with acceleration providers, switch to OpenCV with CPU device. DWPose might run very slowly")
[2024-08-29 10:30] ### Loading: ComfyUI-Manager (V2.50.2)
[2024-08-29 10:30] ### ComfyUI Revision: 2626 [b33cd610] | Released on '2024-08-28'
[2024-08-29 10:30] ------------------------------------------
[2024-08-29 10:30] Comfyroll Studio v1.76 :  175 Nodes Loaded
[2024-08-29 10:30] ------------------------------------------
[2024-08-29 10:30] ** For changes, please see patch notes at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/blob/main/Patch_Notes.md
[2024-08-29 10:30] ** For help, please see the wiki at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/wiki
[2024-08-29 10:30] ------------------------------------------
[2024-08-29 10:30]
Import times for custom nodes:
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/websocket_image_save.py
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/sd-dynamic-thresholding
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI-Custom-Scripts
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI_essentials
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/comfyui_controlnet_aux
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI_Comfyroll_CustomNodes
[2024-08-29 10:30]    0.0 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI-Manager
[2024-08-29 10:30]    0.2 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI-Impact-Pack
[2024-08-29 10:30]    0.3 seconds: /home/runner/ComfyUI/custom_nodes/ComfyUI-Florence2
[2024-08-29 10:30]
[2024-08-29 10:30] Starting server

[2024-08-29 10:30] To see the GUI go to: http://0.0.0.0:8080
[2024-08-29 10:30] [ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[2024-08-29 10:30] [ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[2024-08-29 10:30] [ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[2024-08-29 10:30] [ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[2024-08-29 10:30] [ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[2024-08-29 10:31] FETCH DATA from: /home/runner/ComfyUI/custom_nodes/ComfyUI-Manager/extension-node-map.json [DONE]
[2024-08-29 10:31] got prompt
[2024-08-29 10:31] Using split attention in VAE
[2024-08-29 10:31] Using split attention in VAE
[2024-08-29 10:31] model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
[2024-08-29 10:31] model_type FLUX
[2024-08-29 10:31] /home/runner/ComfyUI/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
[2024-08-29 10:31] Requested to load FluxClipModel_
[2024-08-29 10:31] Loading 1 new model
[2024-08-29 10:31] loaded completely 0.0 9319.23095703125 True
[2024-08-29 10:31] clip missing: ['text_projection.weight']
[2024-08-29 10:31] Warning torch.load doesn't support weights_only on this pytorch version, loading unsafely.
[2024-08-29 10:31] Requested to load ControlNetFlux
[2024-08-29 10:31] Requested to load Flux
[2024-08-29 10:31] Loading 2 new models
[2024-08-29 10:31] loaded completely 0.0 6117.85546875 True
[2024-08-29 10:31] loaded partially 3953.8910156250004 3953.7802734375 0
[2024-08-29 10:31] Requested to load AutoencodingEngine
[2024-08-29 10:31] Loading 1 new model
[2024-08-29 10:31] loaded completely 0.0 319.7467155456543 True
[2024-08-29 10:31] loaded partially 8164.859375 8160.196350097656 0
[2024-08-29 10:31] ran out of memory while running softmax in  _get_attention_scores_no_kv_chunking, trying slower in place softmax instead
[2024-08-29 10:31] ran out of memory while running softmax in  _get_attention_scores_no_kv_chunking, trying slower in place softmax instead
[2024-08-29 10:31] ran out of memory while running softmax in  _get_attention_scores_no_kv_chunking, trying slower in place softmax instead
[2024-08-29 10:31] ran out of memory while running softmax in  _get_attention_scores_no_kv_chunking, trying slower in place softmax instead
[2024-08-29 10:31] ran out of memory while running softmax in  _get_attention_scores_no_kv_chunking, trying slower in place softmax instead
[2024-08-29 10:31]
[2024-08-29 10:31] !!! Exception during processing !!! HIP out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacity of 15.98 GiB of which 164.00 MiB is free. Of the allocated memory 15.22 GiB is allocated by PyTorch, and 64.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-08-29 10:31] Traceback (most recent call last):
  File "/home/runner/ComfyUI/execution.py", line 317, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/execution.py", line 192, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/home/runner/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 716, in sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 695, in inner_sample
    samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 600, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/k_diffusion/sampling.py", line 144, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 299, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 682, in __call__
    return self.predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 685, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 279, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/model_base.py", line 142, in apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/ldm/flux/model.py", line 159, in forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/ldm/flux/model.py", line 118, in forward_orig
    img, txt = block(img=img, txt=txt, vec=vec, pe=pe)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/ldm/flux/layers.py", line 164, in forward
    attn = attention(torch.cat((txt_q, img_q), dim=2),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/ldm/flux/math.py", line 8, in attention
    q, k = apply_rope(q, k, pe)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/ComfyUI/comfy/ldm/flux/math.py", line 32, in apply_rope
    xk_ = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
          ^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacity of 15.98 GiB of which 164.00 MiB is free. Of the allocated memory 15.22 GiB is allocated by PyTorch, and 64.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[2024-08-29 10:31] Got an OOM, unloading all loaded models.

Other

No response

JorgeR81 commented 3 months ago

I can't also use it.

I don't OOM, but at the KSampler stage, GPU activity spikes to 100%, even before the first step, and it just gets stuck.

Tried with --reserve-vram 1.2

GTX 1070 ( 8GB )
32 GB RAM
Windows 10
pytorch version: 2.3.1+cu121

ltdrdata commented 3 months ago

What is your workflow and model files?

jussitus commented 3 months ago

What is your workflow and model files?

cnet.json

ltdrdata commented 3 months ago

This issue is confirmed. Just applying InstantX canny + flux-dev-fp8 causing OOM. (but CFG Guider is used instead of BasicGuider) https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny/tree/main

JorgeR81 commented 3 months ago

I tried the Depth one, from the latest commit ( https://github.com/comfyanonymous/ComfyUI/commit/ea3f39bd6906dd455c867198d4d94152e76ad074 ) and it works with GGUF models. https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth

JorgeR81 commented 3 months ago

By the way, the Depth ControlNet has a strange behavior that we can only see in the console. It starts to generate, but then it stops before the first step. Then it loads some more models, and then finally generates completing all steps.
But it seems to work fine, anyways.

got prompt
Requested to load ControlNetFlux
Loading 1 new model
loaded partially 2311.0572191467286 2310.966796875 0
loaded partially 3378.4771410217286 3378.38671875 0
  0%|                                                                                           | 0/12 [00:00<?, ?it/s]Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
loaded partially 5566.4527809143065 5565.73828125 0
loaded partially 342.82582778930646 342.4921875 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:35<00:00,  7.93s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 111.03 seconds

JorgeR81 commented 3 months ago

With more testing, I noticed other issues with ControlNet Depth:

After a few generations, times get much slower ( about x2 )
And sometimes I get this error : !!! Exception during processing !!! Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

comfyanonymous commented 3 months ago

The flux controlnet OOM should be fixed for most people now.

JorgeR81 commented 3 months ago

For me the Canny one improved somewhat, but it still does not work. Now I can generate the first step successfully, but then GPU activity goes to 100% and it slows down. I have 8GB VRAM and used --reserve-vram 1.2 Tried with a GGUF model ( Q4_K_S )

EDIT: a few more tests. It works !

--reserve-vram 1.6 - still not enough, but better, I can get 3 or 4 steps done. VRAM usage was at 7.8 GB. --reserve-vram 1.8 - It works. VRAM usage reached 7.8 GB ( at about step 14 of 20 ), but it was fast, all the way. --reserve-vram 2.0 - It works. VRAM usage is only at 7.1 GB. --reserve-vram 2.4 - It works, with 2 loras.

jussitus commented 3 months ago

Issue with OOM was fixed by https://github.com/comfyanonymous/ComfyUI/commit/b643eae08b7f0c8eb69b77bd61e31009bfb325b9

comfyanonymous / ComfyUI