Open JorgeR81 opened 2 days ago
I got it converted to gguf f16, by modifying the convert code, but the problem is llama quantize doesn't know sd3 arch. I also had to make a sd3 gguf loader node for comfyui and it works great. Stuck now at making the k files now
SD3 convert https://huggingface.co/ND911/model/blob/main/convert_sd3.py
Nodes https://huggingface.co/ND911/model/blob/main/nodes.py
GGUF https://huggingface.co/ND911/stable-diffusion-3.5-large-GGUF/blob/main/sd3.5_large-F16.gguf
Probably need a patch with directions for llama
I've tried with SD3 before, idk what the hell to do about this specific weight, because the first dimension can't be 1 in any of the C++ code so it just gets stripped and converted to [36 864, 2 432]
which then fails to load when the comfy SD3 specific code hits it.
The diffusers format weights don't have that but those ones have the q/k/v split so it'll just fail on the reshape iirc.
@wolfden I don't think it running in FP16 helps much, because that never involves any of the custom code anyway. Also, if you have to pass in the model type, then doesn't that mean that the auto model detection in comfy has failed, and that your checkpoint is in an invalid format? (Also, we can pass the model type for SD3 in the default node as well as a kwarg without having to have an extra node, since we can just add a switch based on the model arch returned from the loader.)
I was just trying to get it working myself
("sd35", (
("joint_blocks.0.context_block.adaLN_modulation.1.bias",),
)),
to the model detection in convert.py so it would detect and label, also i added to blacklist "embedder." since i think thats the layers for the embeddings in the new model :S
For the gguf node, I added "sd35" to the line 66 i believe to make it so the unet loader supports the new sd model label.
the lpppatch i got to work but think i did it wrong, because i likely needed to add better layer exceptions for fp16/fp32, for my test i only added the exceptions for final_layer as fp16 and final_layer.linear as fp32....
Here's some examples, all based on the example workflow from comfy, euler_ BASELINE Safetensors (16.46gb): FP16 GGUF (16.29gb): Q8_0 GGUF (8.75gb): Q5_K_M GGUF (5.92gb): Something wrong it blew up to a black image, so i definitely broke something LOL
@wolfden I don't think it running in FP16 helps much, because that never involves any of the custom code anyway. Also, if you have to pass in the model type, then doesn't that mean that the auto model detection in comfy has failed, and that your checkpoint is in an invalid format? (Also, we can pass the model type for SD3 in the default node as well as a kwarg without having to have an extra node, since we can just add a switch based on the model arch returned from the loader.)
I agree at f16 it don't help much, that wasn't the goal and I'm not a coder, but can generally figure things out and hack it together. Things work, but again, back to the llama issue
@wolfden All good, happy to hear if anyone figures out anything lol.
@cchance27 Actually, it looks like you are on an older version of the convert script. The current master one does pick up SD3 correctly for the python part at least.
well shit i'll update over to that version, have you already figured out what layers need to be f16/f32 reserved on the lcpp patch?
I have a rewritten version of the c++ logic part from a while ago that works with SD3 but I haven't figured out the issue with that pos_embed key being reshaped yet.
Well, don't think the python convert + comfy loader code needs any extensive changes? Using this file as a base it does pick up the format correctly and seems to work fine. I've added SD3 as a valid arch to the loader https://github.com/city96/ComfyUI-GGUF/commit/0ef158d79348ba725a2ac76376d0273945386b43
I wonder how janky it would be to just patch that single key in post after the llama.cpp convert step for now just to have some weights for people to download lol.
not sure why the git pull didn't pull latest convert, tested the new node.py and working fine here also
I think just having gguf FP16 weights could be better than safetensors.
There is an issue with the safetensors library on windows. ( https://github.com/comfyanonymous/ComfyUI/issues/4480#issuecomment-2325350291 )
It needs about double the RAM of the safetensors FP16 file size while loading. I have this issue with Flux. I only have 32 GB RAM, so it uses my page file. But this does not happen with the Flux GGUF versions.
I think just having an gguf FP16 weights could be better than safetensors.
see above, he confirmed the convert.py already works for FP16 GGUF, it would appear on the latest version, it's the quantizing via llama.cpp patch, thats currently being looked into as the pos_embed seems to be getting messed up when quantizing.
Some more info about the issue at hand, in case anyone has any better ideas.
The SAI reference checkpoint has a weight shaped [1, 36864, 2432]
. In the GGUF file, the order is reversed, and the key is saved as [2432, 36864, 1]
- all good so far. The issue is that when the C++ conversion runs, the weights are padded with 1s for the unused dimensions, so this particular one becomes [2432, 36864, 1, 1]
and during saving becomes [2432, 36864]
with the first (last) dim stripped even if no conversion happens (i.e. kept as FP16).
In comfy, this just manifests as a shape mismatch while loading, as expected:
size mismatch for pos_embed: copying a param with shape torch.Size([36864, 2432]) from checkpoint, the shape in current model is torch.Size([1, 2432, 2432]).
@cchance27 How did you get your version to bypass that specific issue, assuming you even ran into it lol
Also, current WIP patch with the issue above still present, but some logic for SD3 added in and the main parts rewritten. lcpp_new.patch It's for the newest llama.cpp version since I was curious if it'd solve the issue (it didn't).
I stupidly wiped the stash I had, when I saw others were already working on it, the lcpp patch you shared is from a different version than I was playing with, it had the checks for leaving some things at fp32 and some at fp16, my first hack attempt that worked, wasn't excluding the proper layers from being quanted, except for final_layer.
Is it possible theres code elsewhere in the original quantizer for pos_embed? pretty sure thats a layer name they use in some llm's as well isn't it?
Doesn't seem to be name related, good call though (I think the LLM specific ones are called token_embd.weight
).
haha heres something funny, that Q5 model i was having black screens with and erased cause i figured i had messed it up, might not have been messed up because for some reason i just got a black render halfway through the steps on the standard safetensors generation :S
Odd, wonder if it needs BF16 like flux does for the inference. The weights are in FP16 though so that'd be odd. The VAE hash is different from SD3 but they seem close enough at least.
BTW, this is the part that's causing the dimension to be stripped, since yeah, it just checks if it's >1 to see whether it should keep it or not and due to the reversed order it just gets stripped I think.
(Can't manually set it because it gets called directly from gguf_add_tensor
I'm pretty sure`)
int ggml_n_dims(const struct ggml_tensor * tensor) {
for (int i = GGML_MAX_DIMS - 1; i >= 1; --i) {
if (tensor->ne[i] > 1) {
return i + 1;
}
}
return 1;
}
Hacky as hell but it works as a proof of concept lol. It's definitely the dimension detection logic. Setting it manually for just that key makes it work.
Edit: wrong code lol:
@city96 what tag of llama.cpp are you applying th new patch to for some reason mine isn't lining the line numbers correctly
@cchance27 I pushed the patch + the actual tag I was targeting. Should be tags/b3962
. Currently testing on SD3 medium lol, small models seem to have lower bitrate quants the same way LLMs do.
Apparently SD3.5 also uses a different shape for the keys, which is why K quants don't work. We could use the reshape logic that's also used for SDXL but that needs possible future logic changes.
i tried a q8_0 and q4_0 and both when it loads i get
Traceback (most recent call last):
File "/Users/user/AI/ComfyUI/execution.py", line 323, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
File "/Users/user/AI/ComfyUI/execution.py", line 198, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
File "/Users/user/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
process_inputs(input_dict, i)
File "/Users/user/AI/ComfyUI/execution.py", line 158, in process_inputs
results.append(getattr(obj, func)(**inputs))
File "/Users/user/AI/ComfyUI/custom_nodes/ComfyUI-GGUF/nodes.py", line 258, in load_unet
sd = gguf_sd_loader(unet_path)
File "/Users/user/AI/ComfyUI/custom_nodes/ComfyUI-GGUF/nodes.py", line 39, in gguf_sd_loader
reader = gguf.GGUFReader(path)
File "/Users/user/AI/ComfyUI/venv/lib/python3.10/site-packages/gguf/gguf_reader.py", line 130, in __init__
self._build_tensors(offs, tensors_fields)
File "/Users/user/AI/ComfyUI/venv/lib/python3.10/site-packages/gguf/gguf_reader.py", line 314, in _build_tensors
data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 5914608 into shape (2432,2432)
Huh, strange. It works with sd3 medium, but with large it only works with the (old) gguf-py from the llama.cpp repo but not the package one.
I added a print line into that gguf-py for where its crashing and it seems to be on y_embedder.mlp.2.weight [2432 2432] 5914624, which is odd because... thats a F16 so not even one of the Q8 layers?
also it's weird because it didn't crash on the t_embedder_mpl.2.weight that it processed before that that looks to be same size and np_dims
part of the log i added in gguf-py sitepackage
GGMLQuantizationType.Q8_0 25137152 (2432, 10336) joint_blocks.9.x_block.mlp.fc2.weight
GGMLQuantizationType.F16 89653248 (1, 36864, 2432) pos_embed
GGMLQuantizationType.F16 2432 (2432,) t_embedder.mlp.0.bias
GGMLQuantizationType.F16 622592 (2432, 256) t_embedder.mlp.0.weight
GGMLQuantizationType.F16 2432 (2432,) t_embedder.mlp.2.bias
GGMLQuantizationType.F16 5914624 (2432, 2432) t_embedder.mlp.2.weight
GGMLQuantizationType.F16 2432 (2432,) x_embedder.proj.bias
GGMLQuantizationType.F16 155648 (2432, 16, 2, 2) x_embedder.proj.weight
GGMLQuantizationType.F16 2432 (2432,) y_embedder.mlp.0.bias
GGMLQuantizationType.F16 4980736 (2432, 2048) y_embedder.mlp.0.weight
GGMLQuantizationType.F16 2432 (2432,) y_embedder.mlp.2.bias
GGMLQuantizationType.F16 5914624 (2432, 2432) y_embedder.mlp.2.weight
ERROR:root:!!! Exception during processing !!! cannot reshape array of size 5914608 into shape (2432,2432)
why would the y_embedder crash but not the t_embedder step
Okay, I think I got it. I think modifying the metadata after the initial bit where it adds it might be corrupting the data parts, possibly due to the data length changing. That might be because I'm passing it a uint32_t
which probably isn't what it should be setting it to, and there's no boundary check there, i.e. it messes up the tensor right after it? Lemme try and fix that lol.
@cchance27 Could you re-test with the new patch?
Looking good, updated it and rebuilt with new patch, just did a Q8_0 and got passed the loader, diffusing now (i'm on mac it takes a bit to diffuse XD)
Gonna test all the major quants and then upload them to civit in a couple
Sounds good, I'll put some up on huggingface in that case and then we should be all covered I think.
Sounds good thanks for all the help figuring it out, i'd never have tracked down that uint issue lol, too long since i fought with CPP XD
I guess the K quants I'll look at separately sometime on the weekend maybe, also depends on how much community adoption it'll get. I think a lot of people prefer the legacy _0 and _1 quants for speed alone lol.
Also, in case you're lazy like me:
name = "sd3.5_large"
path = "~/ComfyUI/models/unet"
for q in ["Q8_0", "Q5_1", "Q5_0", "Q4_1", "Q4_0"]#, "Q6_K", "Q5_K_M", "Q5_K_S", "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q3_K_S", "Q2_K"]:
print(f"./llama-quantize {path}/{name}-F16.gguf {path}/{name}-{q}.gguf {q}")
Here's the basic quants:
Got it working on 8gb vram! Thanks for the quantised models!
Here's the basic quants:
Thanks, legend! You’re my hero.
Thanks All! Awesome!
SD3.5 large was just released !
https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo https://huggingface.co/stabilityai/stable-diffusion-3.5-large