The workflows related to gguf cause ComfyUI to disconnect.

skimy2023 commented 1 month ago

Why does ComfyUI disconnect and show "reconnecting" when I load workflows related to uugf-flux, stopping at around 15%? I am sure my workflow setup is correct, and the selected UNet, CLIP, and VAE are definitely accurate. I request an answer! screenshot-20240908-052631

Dayuppy commented 1 month ago

I am also getting the same issue seemingly after updating to the latest version from ComfyUI-Manager. I have been at this for almost the entire day trying to figure out why it was closing the websocket until I tested the install locally and found it was crashing with absolutely no error in the log shortly after starting the workflow:

The log shows exactly as in the console with --verbose:

I have tried downgrading to previous commits, so maybe it wasn't an update that caused it unless there are files that were updated outside of the custom nodes folder.

https://github.com/city96/ComfyUI-GGUF/tree/35007f2659478359e4386befe31519e0c067e1a1

skimy2023 commented 1 month ago

Are you sure it's the problem caused by the update of ComfyUI-Manager? Then it would be simple. Just revert to a few previous versions

Dayuppy commented 1 month ago

Are you sure it's the problem caused by the update of ComfyUI-Manager? Then it would be simple. Just revert to a few previous versions

EDIT: This problem returned when I went to load up another previous workflow. I kept deleting the GGUF nodes, resaving the workflow, even did the same trick outlined down below and it kept crashing or giving me an error about a NoneType. I right-clicked the nodes and recreated them, and disconnected all the links. Some combination of doing this over and over fixed it again.

I thought this was the case. I even tried on a brand new install from portable of ComfyUI. Brand new install of ComfyUI-GGUF and ComfyUI-Manager to download the missing nodes for my existing saved workflow. I have done this multiple times, the only constant was the same workflow file.

I even tried deleting the Unet Loader (GGUF) node, then creating it again from within ComfyUI in my workflow and re-saving with the same crashing issue.

I did find something that worked, but I can't be sure of what fixed it completely. I created an entirely new workflow and put a single Unet Loader (GGUF) node in it and saved. That resulted in the following code for that node:

{
      "id": 1,
      "type": "UnetLoaderGGUF",
      "pos": {
        "0": 817,
        "1": 397
      },
      "size": {
        "0": 315,
        "1": 58
      },
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": null,
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "UnetLoaderGGUF"
      },
      "widgets_values": [
        "FLUX\\flux1-dev-Q2_K.gguf"
      ]
    }

I then compared it to my saved workflow node:

I replaced the outputs section of the node in my workflow with that from the code from just the single node in the test workflow and it works now. I really don't understand why, because now I can't reproduce the issue even when I revert the changes and test again. Maybe saving a new workflow fixed it, or when that code was modified, it refreshed another node it was connected to later, I have no clue.


{
      "id": 259,
      "type": "UnetLoaderGGUF",
      "pos": {
        "0": 2250,
        "1": 30
      },
      "size": {
        "0": 390,
        "1": 60
      },
      "flags": {},
      "order": 10,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            607
          ],
          "slot_index": 0,
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "UnetLoaderGGUF"
      },
      "widgets_values": [
        "FLUX\\flux1-dev-Q4_K_S.gguf"
      ]
    },

city96 commented 1 month ago

I think this issue and https://github.com/city96/ComfyUI-GGUF/issues/95 are the same - crash to console without error.

The part I don't get is that it's supposed to print these when the model loading happens:

ggml_sd_loader:
 0                             471
 14                            304
 1                               5
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX

The top print happens after the gguf file was successfully loaded, and the bottom one happens when ComfyUI identifies the model based on the state dict, and I can't think of anything between the two that would cause a hard crash (I don't think anything tries to write to the mmap tensor here?)

There's nothing related to the frontend/workflow that's special about these nodes other than the "title" being set in the init file but the node title/name displays correctly in your screenshot and I think if it was an issue with the frontend it would crash a lot sooner?

On the linked issue above @kovern mentions getting it with FP16/FP8 too if I read it right https://github.com/city96/ComfyUI-GGUF/issues/95#issuecomment-2332808424

What is further interesting to me, that theoretically comfy support split loading of the flux model, so actually I wouldn't even need the gguf model, but when i let comfy handle the split model loading (by choosing the fp8 or fp16 version) it crashes the same way, silently, without any message.

I guess one thing you could check is windows event viewer under "Windows Logs -> Application" there could be ones with the level "error" and a possible error code/description? (around the top it should say "Faulting application name: python.exe")

skimy2023 commented 1 month ago

I’ll apologize first; it may not be an issue with gguf. Here to seek help, but I hope there won't be any misunderstandings. My problem may be the same as issue #95. However, I checked that post and didn't see a final solution.

Log Name: Application Source: Application Error Date: 2024/9/8 6:31:16 Event ID: 1000 Task Category: Application Crash Event Level: Error Description: Error Application Name: python.exe, Version: 3.10.11150.1013, Timestamp: 0x642cc427 Error Module Name: c10.dll, Version: 0.0.0.0, Timestamp: 0x66145942 Exception Code: 0xc0000005 Error Offset: 0x0000000000063064 Error Process ID: 0x0x44A4 Error Application Start Time: 0x0x1DB017575172A42 Error Application Path: E:\ComfyUI-aki-v1.3\python\python.exe Error Module Path: E:\ComfyUI-aki-v1.3\python\lib\site-packages\torch\lib\c10.dll Report ID: 1ec8b37d-7ecb-4297-ac18-279aa9e8362a Error Package Full Name: Error Package Relative Application ID: Event XML:

1000 0 2 100 0 0x8000000000000000 45493 Application skimy python.exe 3.10.11150.1013 642cc427 c10.dll 0.0.0.0 66145942 c0000005 0000000000063064 0x44a4 0x1db017575172a42 E:\ComfyUI-aki-v1.3\python\python.exe E:\ComfyUI-aki-v1.3\python\lib\site-packages\torch\lib\c10.dll 1ec8b37d-7ecb-4297-ac18-279aa9e8362a

city96 commented 1 month ago

@skimy2023 Even if it's not directly caused by gguf, the issue does affect it, so opening the issue here is fine. Also, thanks for the log.

I'm able to reproduce it on torch 2.4.0 if I completely disable pagefile, so it's definitely memory related somehow. Strangely, with pytorch 2.0.1 (very old version) I get a proper error instead of a crash:

But with torch 2.4.0 I get the same as everyone else, even if I have enough memory to load the model. For some reason it seems to rely on/is trying to allocate pagefile...? Possibly due to windows not over-committing memory. This means we'd have to not only optimize the actual "used" memory but also the "committed" memory (visible in task manager) somehow.

So yes, adding (more) pagefile in windows would "fix" it but the issue is definitely strange.

skimy2023 commented 1 month ago

@skimy2023 Even if it's not directly caused by gguf, the issue does affect it, so opening the issue here is fine. Also, thanks for the log.

I'm able to reproduce it on torch 2.4.0 if I completely disable pagefile, so it's definitely memory related somehow. Strangely, with pytorch 2.0.1 (very old version) I get a proper error instead of a crash:

But with torch 2.4.0 I get the same as everyone else, even if I have enough memory to load the model. For some reason it seems to rely on/is trying to allocate pagefile...? Possibly due to windows not over-committing memory. This means we'd have to not only optimize the actual "used" memory but also the "committed" memory (visible in task manager) somehow.

So yes, adding (more) pagefile in windows would "fix" it but the issue is definitely strange.

Thank you for your feedback! I am currently using PyTorch 2.3, and your mention of memory issues has raised some concerns for me. Is there any advice you can offer regarding my situation? For example, should I stick with the current version, or consider upgrading to 2.4.0 or a higher version? I am also looking into other memory optimization methods.

At the same time, I encountered an error related to the Desktop Window Manager (dwm.exe), and the error message is as follows:

Error Application Name: dwm.exe Error Module Name: dwmcore.dll Exception Code: 0xc00001ad This might be related to insufficient memory or graphics drivers. I plan to try updating the graphics drivers and checking the memory. If you have any additional suggestions regarding these issues, please let me know. Thank you very much!

city96 commented 1 month ago

I think you can try adding mode pagefile/setting the amount manually and see if it helps - possibly on a second drive. Keep in mind that this can cause wear on your SSD and will take up some disk space.

Also, just to verify I tested it, and it crashes with the safetensors one as well.

Dayuppy commented 1 month ago

I think you can try adding mode pagefile/setting the amount manually and see if it helps - possibly on a second drive. Keep in mind that this can cause wear on your SSD and will take up some disk space.

Also, just to verify I tested it, and it crashes with the safetensors one as well.

I have 32GB for my pagefile and 32GB RAM on my Desktop PC which is also experiencing the same issue as my laptop that has 16 and 16.

I do run into an Out of Memory Allocation errors on both while loading either the model or the CLIPs with the GGUF loaders and I need to run my ComfyUI with the --disable-cuda-malloc launch parameter.

The issue seems to keep returning just when I think I may have fixed it by recreating the nodes but maybe it it just working coincidentally now and then as I am trying things.

skimy2023 commented 1 month ago

It is still the same problem, but surprisingly, the non-GGUF models are available, though they are just very slow.

Starting server

To see the GUI go to: http://127.0.0.1:8188 FETCH DATA from: E:\ComfyUI-aki-v1.3\custom_nodes\ComfyUI-Manager\extension-node-map.json [DONE] got prompt Using xformers attention in VAE Using xformers attention in VAE E:\ComfyUI-aki-v1.3\custom_nodes\ComfyUI-GGUF\nodes.py:79: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:212.) torch_tensor = torch.from_numpy(tensor.data) # mmap

ggml_sd_loader: GGMLQuantizationType.F16 476 GGMLQuantizationType.Q5_1 304

[程序崩溃，退出代码为 3221225477 (0xC0000005)] 以下是对退出代码的分析。这可能不准确，请酌情参考！系统退出代码名称: ACCESS_VIOLATION 系统退出代码描述: 0x%p 指令引用了 0x%p 内存。该内存不能为 %s。

崩溃堆栈回溯记录: Windows fatal exception: access violation

Stack (most recent call first): File "E:\ComfyUI-aki-v1.3\python\lib\site-packages\torch\nn\modules\linear.py", line 98 in init File "E:\ComfyUI-aki-v1.3\comfy\ldm\flux\layers.py", line 210 in init File "E:\ComfyUI-aki-v1.3\comfy\ldm\flux\model.py", line 81 in File "E:\ComfyUI-aki-v1.3\comfy\ldm\flux\model.py", line 80 in init File "E:\ComfyUI-aki-v1.3\comfy\model_base.py", line 102 in init File "E:\ComfyUI-aki-v1.3\comfy\model_base.py", line 704 in init File "E:\ComfyUI-aki-v1.3\comfy\supported_models.py", line 651 in get_model File "E:\ComfyUI-aki-v1.3\comfy\sd.py", line 651 in load_diffusion_model_state_dict File "E:\ComfyUI-aki-v1.3\custom_nodes\ComfyUI-GGUF\nodes.py", line 259 in load_unet File "E:\ComfyUI-aki-v1.3\execution.py", line 158 in process_inputs File "E:\ComfyUI-aki-v1.3\execution.py", line 169 in _map_node_over_list File "E:\ComfyUI-aki-v1.3\execution.py", line 192 in get_output_data File "E:\ComfyUI-aki-v1.3\execution.py", line 317 in execute File "E:\ComfyUI-aki-v1.3\execution.py", line 494 in execute File "E:\ComfyUI-aki-v1.3\custom_nodes\rgthree-comfy__init__.py", line 211 in rgthree_execute File "E:\ComfyUI-aki-v1.3\main.py", line 125 in prompt_worker File "E:\ComfyUI-aki-v1.3\python\lib\threading.py", line 953 in run File "", line 92 in _run_old_run_func File "", line 99 in run File "E:\ComfyUI-aki-v1.3\python\lib\threading.py", line 1016 in _bootstrap_inner File "E:\ComfyUI-aki-v1.3\python\lib\threading.py", line 973 in _bootstrap

Dayuppy commented 1 month ago

Just an update of what I have tried today.

I set up testing installations of multiple different ComfyUI versions and ensured these did not auto-update on launch by removing --windows-standalone-build launch param.

ComfyUI-Manager was installed however, and it did update some dependencies. I will try a new workflow without ComfyUI-Manager having been installed using my backup of the different ComfyUI versions.

I have tried ComfyUI-GGUFF commits 69f0daf; c8923a4; 0342f0a

I have tried many variations of these commands I put in a bat file to install these packages local to the ComfyUI installs:

.\python_embeded\python.exe -s -m pip install "gguf>=0.9.1" --force-reinstall --no-cache-dir --upgrade --no-deps
.\python_embeded\python.exe -s -m pip install "numba>=0.60.0" --force-reinstall --no-cache-dir --upgrade --no-deps
.\python_embeded\python.exe -s -m pip install "numpy<2.0.0" --force-reinstall --no-cache-dir --upgrade --no-deps

I have tried higher and lower versions of numpy especially. 0.6.0 and 0.91 of gguf. numpy>2.0.0 requires some of the scripts to be modified to support the new methods and I didn't get that far.

I can't even get the right combo to where I can get it to load a GGUF model/clip anymore using my previous workflow.

Dayuppy commented 1 month ago

UPDATED: Even more simple workflow with just load and save still crashes out:

This simple workflow on the most up-to-date version of ComfyUI and Comfy-UI GGUF still results in the crash without error below:

Same issue occurs with ComfyUI v0.1.3 both with and without --disable-cuda-malloc launch param:

I have tested these variations:

city96 commented 1 month ago

@Dayuppy The last ComfyUI gguf commit that'll likely work is 7f3ced6 with the old, more fragile logic as this doesn't initialize the layers using pytorch and reserves less memory by extension.

Dayuppy commented 1 month ago

@Dayuppy The last ComfyUI gguf commit that'll likely work is 7f3ced6 with the old, more fragile logic as this doesn't initialize the layers using pytorch and reserves less memory by extension.

UPDATE: Commit 454955e fixed this issue for me. No need to download the previous commit, you can update instead.

city96 commented 1 month ago

@Dayuppy Could you test the latest version real quick? I pushed a change that might work.

Dayuppy commented 1 month ago

@Dayuppy Could you test the latest version real quick? I pushed a change that might work.

That commit worked on latest ComfyUI! @skimy2023 See if this helps you out?

city96 commented 1 month ago

I don't think this fix breaks anything since after the initial loading part it should be the same as the old one. Do report if anything is borked though lol.

For a breakdown of the issue:

When comfy detects the model type, it initializes an "empty" model to load the weights into.
The weights in this model are not used, but pytorch reserves memory for the full FP16(?) weights on windows.
Windows tries to create pagefile for the full 24GB FP16 model, and runs out of memory/hits the max pagefile limit/runs out of disk space. On windows, even the "reserved" memory needs somewhere to go, so it throws an error.

(@Dayuppy also, assuming no further crashes, could you put a note about the fix working in your comment above with the instruction? Just so people reading the thread sequentially don't revert to that old version for nothing lol)

skimy2023 commented 1 month ago

I don't think this fix breaks anything since after the initial loading part it should be the same as the old one. Do report if anything is borked though lol.

For a breakdown of the issue:

When comfy detects the model type, it initializes an "empty" model to load the weights into.

The weights in this model are not used, but pytorch reserves memory for the full FP16(?) weights on windows.

Windows tries to create pagefile for the full 24GB FP16 model, and runs out of memory/hits the max pagefile limit/runs out of disk space. On windows, even the "reserved" memory needs somewhere to go, so it throws an error.

(@Dayuppy also, assuming no further crashes, could you put a note about the fix working in your comment above with the instruction? Just so people reading the thread sequentially don't revert to that old version for nothing lol)

After updating to the latest version of GGUF, my issue has been resolved. Thank you to the original poster for addressing the problem; you are very responsible and technically proficient!

ClothingAI commented 2 weeks ago

Others or @Dayuppy 1) Where do you find workflows? 2) Where do you put the t5 xxl encoder? In unet or clip?

Dayuppy commented 2 weeks ago

Others or @Dayuppy 1) Where do you find workflows? 2) Where do you put the t5 xxl encoder? In unet or clip?

1) Join the ComfyUI discord for questions like these ( https://www.comfy.org/discord ). This isn't really the appropriate place for questions not pertaining to this specific issue.

2) Still not really the appropriate place, but T5 XXL Encoder is a clip and can be loaded with the Dual Clip Loader this mod provides along with the Flux clip model.

ClothingAI commented 2 weeks ago

Ok thanks. (I was asking about the directories in comfyui\models\folders..) But I figured it out. It was confusing because sometimes flux is unet, sometimes checkpoint. and then we have GGUF I thought that was different, bit it does not seem to change nature of htings, the gguf clip stays in the the clip folder, in the meantime unet of nf4 etc change directory hence the question. As for workflows, I was wondering because no example folder Here. anyway thanks. Will check the discord.

city96 / ComfyUI-GGUF

The workflows related to gguf cause ComfyUI to disconnect. #102