city96 / ComfyUI-GGUF

GGUF Quantization support for native ComfyUI models
Apache License 2.0
901 stars 56 forks source link

Add GGUF versions for SD3.5 #136

Open JorgeR81 opened 2 days ago

JorgeR81 commented 2 days ago

SD3.5 large was just released !

https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo https://huggingface.co/stabilityai/stable-diffusion-3.5-large

wolfden commented 2 days ago

I got it converted to gguf f16, by modifying the convert code, but the problem is llama quantize doesn't know sd3 arch. I also had to make a sd3 gguf loader node for comfyui and it works great. Stuck now at making the k files now

SD3 convert https://huggingface.co/ND911/model/blob/main/convert_sd3.py

Nodes https://huggingface.co/ND911/model/blob/main/nodes.py

GGUF https://huggingface.co/ND911/stable-diffusion-3.5-large-GGUF/blob/main/sd3.5_large-F16.gguf

Probably need a patch with directions for llama

city96 commented 2 days ago

I've tried with SD3 before, idk what the hell to do about this specific weight, because the first dimension can't be 1 in any of the C++ code so it just gets stripped and converted to [36 864, 2 432] which then fails to load when the comfy SD3 specific code hits it.

image

The diffusers format weights don't have that but those ones have the q/k/v split so it'll just fail on the reshape iirc.

@wolfden I don't think it running in FP16 helps much, because that never involves any of the custom code anyway. Also, if you have to pass in the model type, then doesn't that mean that the auto model detection in comfy has failed, and that your checkpoint is in an invalid format? (Also, we can pass the model type for SD3 in the default node as well as a kwarg without having to have an extra node, since we can just add a switch based on the model arch returned from the loader.)

cchance27 commented 2 days ago

I was just trying to get it working myself

("sd35", (
    ("joint_blocks.0.context_block.adaLN_modulation.1.bias",),
)),

to the model detection in convert.py so it would detect and label, also i added to blacklist "embedder." since i think thats the layers for the embeddings in the new model :S

For the gguf node, I added "sd35" to the line 66 i believe to make it so the unet loader supports the new sd model label.

the lpppatch i got to work but think i did it wrong, because i likely needed to add better layer exceptions for fp16/fp32, for my test i only added the exceptions for final_layer as fp16 and final_layer.linear as fp32....

Here's some examples, all based on the example workflow from comfy, euler_ BASELINE Safetensors (16.46gb): sd3_safetensors FP16 GGUF (16.29gb): sd3_gguf_fp16 Q8_0 GGUF (8.75gb): sd3_gguf_Q8_0 Q5_K_M GGUF (5.92gb): Something wrong it blew up to a black image, so i definitely broke something LOL

wolfden commented 2 days ago

@wolfden I don't think it running in FP16 helps much, because that never involves any of the custom code anyway. Also, if you have to pass in the model type, then doesn't that mean that the auto model detection in comfy has failed, and that your checkpoint is in an invalid format? (Also, we can pass the model type for SD3 in the default node as well as a kwarg without having to have an extra node, since we can just add a switch based on the model arch returned from the loader.)

I agree at f16 it don't help much, that wasn't the goal and I'm not a coder, but can generally figure things out and hack it together. Things work, but again, back to the llama issue

city96 commented 2 days ago

@wolfden All good, happy to hear if anyone figures out anything lol.

@cchance27 Actually, it looks like you are on an older version of the convert script. The current master one does pick up SD3 correctly for the python part at least.

image

cchance27 commented 2 days ago

well shit i'll update over to that version, have you already figured out what layers need to be f16/f32 reserved on the lcpp patch?

city96 commented 2 days ago

I have a rewritten version of the c++ logic part from a while ago that works with SD3 but I haven't figured out the issue with that pos_embed key being reshaped yet.

city96 commented 2 days ago

Well, don't think the python convert + comfy loader code needs any extensive changes? Using this file as a base it does pick up the format correctly and seems to work fine. I've added SD3 as a valid arch to the loader https://github.com/city96/ComfyUI-GGUF/commit/0ef158d79348ba725a2ac76376d0273945386b43

image

I wonder how janky it would be to just patch that single key in post after the llama.cpp convert step for now just to have some weights for people to download lol.

wolfden commented 2 days ago

not sure why the git pull didn't pull latest convert, tested the new node.py and working fine here also

JorgeR81 commented 2 days ago

I think just having gguf FP16 weights could be better than safetensors.

There is an issue with the safetensors library on windows. ( https://github.com/comfyanonymous/ComfyUI/issues/4480#issuecomment-2325350291 )

It needs about double the RAM of the safetensors FP16 file size while loading.  I have this issue with Flux. I only have 32 GB RAM, so it uses my page file. But this does not happen with the Flux GGUF versions.

cchance27 commented 2 days ago

I think just having an gguf FP16 weights could be better than safetensors.

see above, he confirmed the convert.py already works for FP16 GGUF, it would appear on the latest version, it's the quantizing via llama.cpp patch, thats currently being looked into as the pos_embed seems to be getting messed up when quantizing.

city96 commented 2 days ago

Some more info about the issue at hand, in case anyone has any better ideas.

The SAI reference checkpoint has a weight shaped [1, 36864, 2432]. In the GGUF file, the order is reversed, and the key is saved as [2432, 36864, 1] - all good so far. The issue is that when the C++ conversion runs, the weights are padded with 1s for the unused dimensions, so this particular one becomes [2432, 36864, 1, 1] and during saving becomes [2432, 36864] with the first (last) dim stripped even if no conversion happens (i.e. kept as FP16).

image

In comfy, this just manifests as a shape mismatch while loading, as expected:

size mismatch for pos_embed: copying a param with shape torch.Size([36864, 2432]) from checkpoint, the shape in current model is torch.Size([1, 2432, 2432]).

@cchance27 How did you get your version to bypass that specific issue, assuming you even ran into it lol

Also, current WIP patch with the issue above still present, but some logic for SD3 added in and the main parts rewritten. lcpp_new.patch It's for the newest llama.cpp version since I was curious if it'd solve the issue (it didn't).

cchance27 commented 2 days ago

I stupidly wiped the stash I had, when I saw others were already working on it, the lcpp patch you shared is from a different version than I was playing with, it had the checks for leaving some things at fp32 and some at fp16, my first hack attempt that worked, wasn't excluding the proper layers from being quanted, except for final_layer.

Is it possible theres code elsewhere in the original quantizer for pos_embed? pretty sure thats a layer name they use in some llm's as well isn't it?

city96 commented 2 days ago

Doesn't seem to be name related, good call though (I think the LLM specific ones are called token_embd.weight).

image

cchance27 commented 2 days ago

haha heres something funny, that Q5 model i was having black screens with and erased cause i figured i had messed it up, might not have been messed up because for some reason i just got a black render halfway through the steps on the standard safetensors generation :S

city96 commented 2 days ago

Odd, wonder if it needs BF16 like flux does for the inference. The weights are in FP16 though so that'd be odd. The VAE hash is different from SD3 but they seem close enough at least.

BTW, this is the part that's causing the dimension to be stripped, since yeah, it just checks if it's >1 to see whether it should keep it or not and due to the reversed order it just gets stripped I think.

(Can't manually set it because it gets called directly from gguf_add_tensor I'm pretty sure`)


int ggml_n_dims(const struct ggml_tensor * tensor) {
    for (int i = GGML_MAX_DIMS - 1; i >= 1; --i) {
        if (tensor->ne[i] > 1) {
            return i + 1;
        }
    }
    return 1;
}
city96 commented 2 days ago

Hacky as hell but it works as a proof of concept lol. It's definitely the dimension detection logic. Setting it manually for just that key makes it work.

Edit: wrong code lol:

image

image

cchance27 commented 2 days ago

@city96 what tag of llama.cpp are you applying th new patch to for some reason mine isn't lining the line numbers correctly

city96 commented 2 days ago

@cchance27 I pushed the patch + the actual tag I was targeting. Should be tags/b3962. Currently testing on SD3 medium lol, small models seem to have lower bitrate quants the same way LLMs do.

Apparently SD3.5 also uses a different shape for the keys, which is why K quants don't work. We could use the reshape logic that's also used for SDXL but that needs possible future logic changes.

cchance27 commented 2 days ago

i tried a q8_0 and q4_0 and both when it loads i get

Traceback (most recent call last):
  File "/Users/user/AI/ComfyUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/Users/user/AI/ComfyUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/Users/user/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/Users/user/AI/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
  File "/Users/user/AI/ComfyUI/custom_nodes/ComfyUI-GGUF/nodes.py", line 258, in load_unet
    sd = gguf_sd_loader(unet_path)
  File "/Users/user/AI/ComfyUI/custom_nodes/ComfyUI-GGUF/nodes.py", line 39, in gguf_sd_loader
    reader = gguf.GGUFReader(path)
  File "/Users/user/AI/ComfyUI/venv/lib/python3.10/site-packages/gguf/gguf_reader.py", line 130, in __init__
    self._build_tensors(offs, tensors_fields)
  File "/Users/user/AI/ComfyUI/venv/lib/python3.10/site-packages/gguf/gguf_reader.py", line 314, in _build_tensors
    data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 5914608 into shape (2432,2432)
city96 commented 2 days ago

Huh, strange. It works with sd3 medium, but with large it only works with the (old) gguf-py from the llama.cpp repo but not the package one.

cchance27 commented 2 days ago

I added a print line into that gguf-py for where its crashing and it seems to be on y_embedder.mlp.2.weight [2432 2432] 5914624, which is odd because... thats a F16 so not even one of the Q8 layers?

also it's weird because it didn't crash on the t_embedder_mpl.2.weight that it processed before that that looks to be same size and np_dims

part of the log i added in gguf-py sitepackage

GGMLQuantizationType.Q8_0 25137152 (2432, 10336) joint_blocks.9.x_block.mlp.fc2.weight
GGMLQuantizationType.F16 89653248 (1, 36864, 2432) pos_embed
GGMLQuantizationType.F16 2432 (2432,) t_embedder.mlp.0.bias
GGMLQuantizationType.F16 622592 (2432, 256) t_embedder.mlp.0.weight
GGMLQuantizationType.F16 2432 (2432,) t_embedder.mlp.2.bias
GGMLQuantizationType.F16 5914624 (2432, 2432) t_embedder.mlp.2.weight
GGMLQuantizationType.F16 2432 (2432,) x_embedder.proj.bias
GGMLQuantizationType.F16 155648 (2432, 16, 2, 2) x_embedder.proj.weight
GGMLQuantizationType.F16 2432 (2432,) y_embedder.mlp.0.bias
GGMLQuantizationType.F16 4980736 (2432, 2048) y_embedder.mlp.0.weight
GGMLQuantizationType.F16 2432 (2432,) y_embedder.mlp.2.bias
GGMLQuantizationType.F16 5914624 (2432, 2432) y_embedder.mlp.2.weight
ERROR:root:!!! Exception during processing !!! cannot reshape array of size 5914608 into shape (2432,2432)

why would the y_embedder crash but not the t_embedder step

city96 commented 2 days ago

Okay, I think I got it. I think modifying the metadata after the initial bit where it adds it might be corrupting the data parts, possibly due to the data length changing. That might be because I'm passing it a uint32_t which probably isn't what it should be setting it to, and there's no boundary check there, i.e. it messes up the tensor right after it? Lemme try and fix that lol.

city96 commented 2 days ago

@cchance27 Could you re-test with the new patch?

cchance27 commented 2 days ago

Looking good, updated it and rebuilt with new patch, just did a Q8_0 and got passed the loader, diffusing now (i'm on mac it takes a bit to diffuse XD)

cchance27 commented 2 days ago

Gonna test all the major quants and then upload them to civit in a couple

city96 commented 2 days ago

Sounds good, I'll put some up on huggingface in that case and then we should be all covered I think.

cchance27 commented 2 days ago

Sounds good thanks for all the help figuring it out, i'd never have tracked down that uint issue lol, too long since i fought with CPP XD

city96 commented 1 day ago

I guess the K quants I'll look at separately sometime on the weekend maybe, also depends on how much community adoption it'll get. I think a lot of people prefer the legacy _0 and _1 quants for speed alone lol.

Also, in case you're lazy like me:

name = "sd3.5_large"
path = "~/ComfyUI/models/unet"

for q in ["Q8_0", "Q5_1", "Q5_0", "Q4_1", "Q4_0"]#, "Q6_K", "Q5_K_M", "Q5_K_S", "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q3_K_S", "Q2_K"]:
    print(f"./llama-quantize {path}/{name}-F16.gguf {path}/{name}-{q}.gguf {q}")
city96 commented 1 day ago

Here's the basic quants:

doogyhatts commented 1 day ago

Got it working on 8gb vram! Thanks for the quantised models!

Amazon90 commented 1 day ago

Here's the basic quants:

Thanks, legend! You’re my hero.

wolfden commented 1 day ago

Thanks All! Awesome!