city96 / ComfyUI-GGUF

GGUF Quantization support for native ComfyUI models
Apache License 2.0
709 stars 40 forks source link

support stable-diffusion.cpp gguf models for SD1/SDXL #53

Closed al-swaiti closed 1 week ago

al-swaiti commented 3 weeks ago

i made sdxl gguf model ,, unfournutly the node not support it !

city96 commented 3 weeks ago

You most likely disabled all the key checks and converted the entire checkpoint including CLIP and VAE as well instead of just using the UNET. Extract just the SDXL UNET in the diffusers format and save it to safetensors if you really want to convert it, though you most likely won't gain any real benefit from it with a non-transformer model.

model = diffusers.UNet2DConditionModel.from_single_file(some_model_path.safetensors)
save_file(model.state_dict(), some_unet_path.safetensors)
al-swaiti commented 2 weeks ago

here i already convert it to gguf after i seperate the unet , used this https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/quantization_and_gguf.md to convert to gguf , the issue was the resulted model its not supported by node @city96

city96 commented 2 weeks ago

Ah yeah, they're using a different format from what we are for conv2d stuff. We have some initial SDXL support going but we're actually storing the original shape as a separate key, will have to look how they're doing this and see if we can support it though it's not a massive priority atm or anything. Reopening and changing the title.

city96 commented 2 weeks ago

Pinging @blepping since he worked on our SDXL implementation here https://github.com/city96/ComfyUI-GGUF/pull/63 in case this is something he wants to look into.

blepping commented 2 weeks ago

Pinging @blepping since he worked on our SDXL implementation here #63 in case this is something he wants to look into.

i actually looked at stable-diffusion.cpp and was all set to say "hey, let's use this for converting and skip the having to patch llama.cpp stuff" but it seemed like they did some stuff differently (including key names).

definitely would be good to be compatible though, maybe it's as simple as having a key conversion table. i'll take a closer look when i get a chance.

blepping commented 2 weeks ago

progress in #80 - that implementation seems to work. more testing would be helpful.

theoretically should work for Flux too (stable-diffusion.cpp claims to support it). i didn't test that, it may require adding more ops if they quantize layer types that ComfyUI-GGUF currently doesn't which was the case with SD15 at least.

al-swaiti commented 2 weeks ago

https://github.com/user-attachments/assets/be6ef0b0-06a6-48eb-bf67-2edfe6c6548d

i extract unet model from my (sdxl model) then apply convert,py , to convert it to bf 16 , its work perfect

@city96 @blepping , i tried to quantize it using llama.cpp i face this error

``` sh `/home/abdallah/Desktop/webui/ComfyUI/custom_nodes/ComfyUI-GGUF/tools/llama.cpp/build/bin/llama-quantize /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/sdxl-lcm.gguf q4kss.gguf Q4_0 main: build = 3600 (2fb92678) main: built with cc (GCC) 14.2.1 20240805 for x86_64-pc-linux-gnu main: quantizing '/home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/sdxl-lcm.gguf' to 'q4kss.gguf' as Q4_0 llama_model_loader: loaded meta data with 135 key-value pairs and 1680 tensors from /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/sdxl-lcm.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = sdxl llama_model_loader: - kv 1: general.quantization_version u32 = 2 llama_model_loader: - kv 2: general.file_type u32 = 1 llama_model_loader: - kv 3: comfy.gguf.orig_shape.input_blocks.0.0.weight arr[i32,4] = [320, 4, 3, 3] llama_model_loader: - kv 4: comfy.gguf.orig_shape.input_blocks.1.0.in_layers.2.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 5: comfy.gguf.orig_shape.input_blocks.1.0.out_layers.3.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 6: comfy.gguf.orig_shape.input_blocks.2.0.in_layers.2.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 7: comfy.gguf.orig_shape.input_blocks.2.0.out_layers.3.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 8: comfy.gguf.orig_shape.input_blocks.3.0.op.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 9: comfy.gguf.orig_shape.input_blocks.4.0.in_layers.2.weight arr[i32,4] = [640, 320, 3, 3] llama_model_loader: - kv 10: comfy.gguf.orig_shape.input_blocks.4.0.out_layers.3.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 11: comfy.gguf.orig_shape.input_blocks.4.0.skip_connection.weight arr[i32,4] = [640, 320, 1, 1] llama_model_loader: - kv 12: comfy.gguf.orig_shape.input_blocks.4.1.proj_in.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 13: comfy.gguf.orig_shape.input_blocks.4.1.proj_out.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 14: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 15: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 16: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 17: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 18: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 19: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 20: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.0.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 21: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 22: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 23: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 24: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 25: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 26: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 27: comfy.gguf.orig_shape.input_blocks.4.1.transformer_blocks.1.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 28: comfy.gguf.orig_shape.input_blocks.5.0.in_layers.2.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 29: comfy.gguf.orig_shape.input_blocks.5.0.out_layers.3.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 30: comfy.gguf.orig_shape.input_blocks.5.1.proj_in.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 31: comfy.gguf.orig_shape.input_blocks.5.1.proj_out.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 32: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 33: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 34: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 35: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 36: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 37: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 38: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.0.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 39: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 40: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 41: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 42: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 43: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 44: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 45: comfy.gguf.orig_shape.input_blocks.5.1.transformer_blocks.1.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 46: comfy.gguf.orig_shape.input_blocks.6.0.op.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 47: comfy.gguf.orig_shape.input_blocks.7.0.in_layers.2.weight arr[i32,4] = [1280, 640, 3, 3] llama_model_loader: - kv 48: comfy.gguf.orig_shape.input_blocks.7.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 49: comfy.gguf.orig_shape.input_blocks.7.0.skip_connection.weight arr[i32,4] = [1280, 640, 1, 1] llama_model_loader: - kv 50: comfy.gguf.orig_shape.input_blocks.8.0.in_layers.2.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 51: comfy.gguf.orig_shape.input_blocks.8.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 52: comfy.gguf.orig_shape.middle_block.0.in_layers.2.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 53: comfy.gguf.orig_shape.middle_block.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 54: comfy.gguf.orig_shape.middle_block.2.in_layers.2.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 55: comfy.gguf.orig_shape.middle_block.2.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 56: comfy.gguf.orig_shape.out.2.weight arr[i32,4] = [4, 320, 3, 3] llama_model_loader: - kv 57: comfy.gguf.orig_shape.output_blocks.0.0.in_layers.2.weight arr[i32,4] = [1280, 2560, 3, 3] llama_model_loader: - kv 58: comfy.gguf.orig_shape.output_blocks.0.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 59: comfy.gguf.orig_shape.output_blocks.0.0.skip_connection.weight arr[i32,4] = [1280, 2560, 1, 1] llama_model_loader: - kv 60: comfy.gguf.orig_shape.output_blocks.1.0.in_layers.2.weight arr[i32,4] = [1280, 2560, 3, 3] llama_model_loader: - kv 61: comfy.gguf.orig_shape.output_blocks.1.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 62: comfy.gguf.orig_shape.output_blocks.1.0.skip_connection.weight arr[i32,4] = [1280, 2560, 1, 1] llama_model_loader: - kv 63: comfy.gguf.orig_shape.output_blocks.2.0.in_layers.2.weight arr[i32,4] = [1280, 1920, 3, 3] llama_model_loader: - kv 64: comfy.gguf.orig_shape.output_blocks.2.0.out_layers.3.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 65: comfy.gguf.orig_shape.output_blocks.2.0.skip_connection.weight arr[i32,4] = [1280, 1920, 1, 1] llama_model_loader: - kv 66: comfy.gguf.orig_shape.output_blocks.2.2.conv.weight arr[i32,4] = [1280, 1280, 3, 3] llama_model_loader: - kv 67: comfy.gguf.orig_shape.output_blocks.3.0.in_layers.2.weight arr[i32,4] = [640, 1920, 3, 3] llama_model_loader: - kv 68: comfy.gguf.orig_shape.output_blocks.3.0.out_layers.3.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 69: comfy.gguf.orig_shape.output_blocks.3.0.skip_connection.weight arr[i32,4] = [640, 1920, 1, 1] llama_model_loader: - kv 70: comfy.gguf.orig_shape.output_blocks.3.1.proj_in.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 71: comfy.gguf.orig_shape.output_blocks.3.1.proj_out.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 72: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 73: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 74: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 75: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 76: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 77: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 78: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.0.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 79: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 80: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 81: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 82: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 83: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 84: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 85: comfy.gguf.orig_shape.output_blocks.3.1.transformer_blocks.1.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 86: comfy.gguf.orig_shape.output_blocks.4.0.in_layers.2.weight arr[i32,4] = [640, 1280, 3, 3] llama_model_loader: - kv 87: comfy.gguf.orig_shape.output_blocks.4.0.out_layers.3.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 88: comfy.gguf.orig_shape.output_blocks.4.0.skip_connection.weight arr[i32,4] = [640, 1280, 1, 1] llama_model_loader: - kv 89: comfy.gguf.orig_shape.output_blocks.4.1.proj_in.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 90: comfy.gguf.orig_shape.output_blocks.4.1.proj_out.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 91: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 92: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 93: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 94: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 95: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 96: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 97: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.0.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 98: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 99: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 100: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 101: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 102: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 103: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 104: comfy.gguf.orig_shape.output_blocks.4.1.transformer_blocks.1.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 105: comfy.gguf.orig_shape.output_blocks.5.0.in_layers.2.weight arr[i32,4] = [640, 960, 3, 3] llama_model_loader: - kv 106: comfy.gguf.orig_shape.output_blocks.5.0.out_layers.3.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 107: comfy.gguf.orig_shape.output_blocks.5.0.skip_connection.weight arr[i32,4] = [640, 960, 1, 1] llama_model_loader: - kv 108: comfy.gguf.orig_shape.output_blocks.5.1.proj_in.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 109: comfy.gguf.orig_shape.output_blocks.5.1.proj_out.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 110: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 111: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 112: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 113: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 114: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 115: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 116: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.0.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 117: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn1.to_k.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 118: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn1.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 119: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn1.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 120: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn1.to_v.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 121: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn2.to_out.0.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 122: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.attn2.to_q.weight arr[i32,2] = [640, 640] llama_model_loader: - kv 123: comfy.gguf.orig_shape.output_blocks.5.1.transformer_blocks.1.ff.net.0.proj.weight arr[i32,2] = [5120, 640] llama_model_loader: - kv 124: comfy.gguf.orig_shape.output_blocks.5.2.conv.weight arr[i32,4] = [640, 640, 3, 3] llama_model_loader: - kv 125: comfy.gguf.orig_shape.output_blocks.6.0.in_layers.2.weight arr[i32,4] = [320, 960, 3, 3] llama_model_loader: - kv 126: comfy.gguf.orig_shape.output_blocks.6.0.out_layers.3.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 127: comfy.gguf.orig_shape.output_blocks.6.0.skip_connection.weight arr[i32,4] = [320, 960, 1, 1] llama_model_loader: - kv 128: comfy.gguf.orig_shape.output_blocks.7.0.in_layers.2.weight arr[i32,4] = [320, 640, 3, 3] llama_model_loader: - kv 129: comfy.gguf.orig_shape.output_blocks.7.0.out_layers.3.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 130: comfy.gguf.orig_shape.output_blocks.7.0.skip_connection.weight arr[i32,4] = [320, 640, 1, 1] llama_model_loader: - kv 131: comfy.gguf.orig_shape.output_blocks.8.0.in_layers.2.weight arr[i32,4] = [320, 640, 3, 3] llama_model_loader: - kv 132: comfy.gguf.orig_shape.output_blocks.8.0.out_layers.3.weight arr[i32,4] = [320, 320, 3, 3] llama_model_loader: - kv 133: comfy.gguf.orig_shape.output_blocks.8.0.skip_connection.weight arr[i32,4] = [320, 640, 1, 1] llama_model_loader: - kv 134: comfy.gguf.orig_shape.time_embed.0.weight arr[i32,2] = [1280, 320] llama_model_loader: - type f16: 1680 tensors llama_model_quantize: failed to quantize: unknown model architecture: 'sdxl' main: failed to quantize model from '/home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/sdxl-lcm.gguf'` ```
al-swaiti commented 2 weeks ago

i converted the same unet model using https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/quantization_and_gguf.md successfuly to Q8_0 but the gguf loader give me this error Error occurred when executing UnetLoaderGGUF:

``` bash 'conv_in.weight' File "/home/abdallah/Desktop/webui/ComfyUI/execution.py", line 317, in execute output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/execution.py", line 192, in get_output_data return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/execution.py", line 169, in _map_node_over_list process_inputs(input_dict, i) File "/home/abdallah/Desktop/webui/ComfyUI/execution.py", line 158, in process_inputs results.append(getattr(obj, func)(**inputs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/custom_nodes/ComfyUI-GGUF/nodes.py", line 196, in load_unet model = comfy.sd.load_diffusion_model_state_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/comfy/sd.py", line 629, in load_diffusion_model_state_dict model_config = model_detection.model_config_from_diffusers_unet(sd) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/comfy/model_detection.py", line 496, in model_config_from_diffusers_unet unet_config = unet_config_from_diffusers_unet(state_dict) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/abdallah/Desktop/webui/ComfyUI/comfy/model_detection.py", line 372, in unet_config_from_diffusers_unet match["model_channels"] = state_dict["conv_in.weight"].shape[0] ~~~~~~~~~~^^^^^^^^^^^^^^^^^^ ```
al-swaiti commented 2 weeks ago

the result of using model i quantized using your method after patch the time 8 min output llama/cpp through https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/flux.md

``` sh /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/sd --diffusion-model /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf --vae /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors --clip_l /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors --t5xxl /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 Option: n_threads: 6 mode: txt2img model_path: wtype: unspecified clip_l_path: /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors t5xxl_path: /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors diffusion_model_path: /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf vae_path: /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors taesd_path: esrgan_path: controlnet_path: embeddings_path: stacked_id_embeddings_path: input_id_images_path: style ratio: 20.00 normalize input image : false output_path: output.png init_img: control_image: clip on cpu: false controlnet cpu: false vae decoder on cpu:false strength(control): 0.90 prompt: a lovely cat holding a sign says 'flux.cpp' negative_prompt: min_cfg: 1.00 cfg_scale: 1.00 guidance: 3.50 clip_skip: -1 width: 512 height: 512 sample_method: euler schedule: default sample_steps: 4 strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: false upscale_repeats: 1 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:180 - Using CPU backend [INFO ] stable-diffusion.cpp:202 - loading clip_l from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors' [INFO ] stable-diffusion.cpp:209 - loading t5xxl from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors' [INFO ] stable-diffusion.cpp:216 - loading diffusion model from '/home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf' [INFO ] model.cpp:790 - load /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf using gguf format [DEBUG] model.cpp:807 - init from '/home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf' [INFO ] stable-diffusion.cpp:223 - loading vae from '/home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors' [INFO ] stable-diffusion.cpp:235 - Version: Flux Schnell [INFO ] stable-diffusion.cpp:266 - Weight type: f16 [INFO ] stable-diffusion.cpp:267 - Conditioner weight type: f16 [INFO ] stable-diffusion.cpp:268 - Diffusion model weight type: q5_K [INFO ] stable-diffusion.cpp:269 - VAE weight type: f32 [DEBUG] stable-diffusion.cpp:271 - ggml tensor size = 400 bytes [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] ggml_extend.hpp:1046 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors) [DEBUG] ggml_extend.hpp:1046 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors) [DEBUG] ggml_extend.hpp:1046 - flux params backend buffer size = 7806.81 MB(RAM) (776 tensors) [DEBUG] ggml_extend.hpp:1046 - vae params backend buffer size = 94.57 MB(RAM) (138 tensors) [DEBUG] stable-diffusion.cpp:398 - loading weights [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors [INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f8_e4m3 | 2 [4096, 32128, 1, 1, 1]' in model file [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/ComfyUI/models/diffusion_models/Q5_K_M.gguf [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors [INFO ] stable-diffusion.cpp:482 - total params memory size = 17220.22MB (VRAM 0.00MB, RAM 17220.22MB): clip 9318.83MB(RAM), unet 7806.81MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM) [INFO ] stable-diffusion.cpp:501 - loading model from '' completed, taking 129.95s [INFO ] stable-diffusion.cpp:518 - running in Flux FLOW mode [DEBUG] stable-diffusion.cpp:572 - finished loaded file [DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512 [DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a lovely cat holding a sign says 'flux.cpp'" [INFO ] stable-diffusion.cpp:655 - Attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s [DEBUG] conditioner.hpp:1036 - parse 'a lovely cat holding a sign says 'flux.cpp'' to [['a lovely cat holding a sign says 'flux.cpp'', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 256 [DEBUG] ggml_extend.hpp:998 - t5 compute buffer size: 68.25 MB(RAM) [DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 11132 ms [INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 11137 ms [INFO ] stable-diffusion.cpp:1279 - sampling using Euler method [INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 42 [DEBUG] ggml_extend.hpp:998 - flux compute buffer size: 397.27 MB(RAM) |============> | 1/4 - 76.66s/it |==================================================| 4/4 - 99.05s/it [INFO ] stable-diffusion.cpp:1315 - sampling completed, taking 368.69s [INFO ] stable-diffusion.cpp:1323 - generating 1 latent images completed, taking 369.10s [INFO ] stable-diffusion.cpp:1326 - decoding 1 latents [DEBUG] ggml_extend.hpp:998 - vae compute buffer size: 1664.00 MB(RAM) [DEBUG] stable-diffusion.cpp:987 - computing vae [mode: DECODE] graph completed, taking 19.89s [INFO ] stable-diffusion.cpp:1336 - latent 1 decoded, taking 19.89s [INFO ] stable-diffusion.cpp:1340 - decode_first_stage completed, taking 19.89s [INFO ] stable-diffusion.cpp:1449 - txt2img completed in 400.13s save result image to 'output.png' ```
al-swaiti commented 2 weeks ago

same result using his way of quantization

image

```bash /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/sd --diffusion-model /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf --vae /home/abdallah/Desktop/webui/ComfyUI /models/vae/flow.safetensors --clip_l /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors --t5xxl /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" -o ./x.png --cfg-scale 1.0 --sampling-method euler -v --steps 4 Option: n_threads: 6 mode: txt2img model_path: wtype: unspecified clip_l_path: /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors t5xxl_path: /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors diffusion_model_path: /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf vae_path: /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors taesd_path: esrgan_path: controlnet_path: embeddings_path: stacked_id_embeddings_path: input_id_images_path: style ratio: 20.00 normalize input image : false output_path: ./x.png init_img: control_image: clip on cpu: false controlnet cpu: false vae decoder on cpu:false strength(control): 0.90 prompt: a lovely cat holding a sign says 'flux.cpp' negative_prompt: min_cfg: 1.00 cfg_scale: 1.00 guidance: 3.50 clip_skip: -1 width: 512 height: 512 sample_method: euler schedule: default sample_steps: 4 strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: false upscale_repeats: 1 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:180 - Using CPU backend [INFO ] stable-diffusion.cpp:202 - loading clip_l from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors' [INFO ] stable-diffusion.cpp:209 - loading t5xxl from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors' [INFO ] stable-diffusion.cpp:216 - loading diffusion model from '/home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf' [INFO ] model.cpp:790 - load /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf using gguf format [DEBUG] model.cpp:807 - init from '/home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf' WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc! [INFO ] stable-diffusion.cpp:223 - loading vae from '/home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors' [INFO ] model.cpp:793 - load /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors using safetensors format [DEBUG] model.cpp:861 - init from '/home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors' [INFO ] stable-diffusion.cpp:235 - Version: Flux Schnell [INFO ] stable-diffusion.cpp:266 - Weight type: f16 [INFO ] stable-diffusion.cpp:267 - Conditioner weight type: f16 [INFO ] stable-diffusion.cpp:268 - Diffusion model weight type: q8_0 [INFO ] stable-diffusion.cpp:269 - VAE weight type: f32 [DEBUG] stable-diffusion.cpp:271 - ggml tensor size = 400 bytes [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] ggml_extend.hpp:1046 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors) [DEBUG] ggml_extend.hpp:1046 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors) [DEBUG] ggml_extend.hpp:1046 - flux params backend buffer size = 12057.71 MB(RAM) (776 tensors) [DEBUG] ggml_extend.hpp:1046 - vae params backend buffer size = 94.57 MB(RAM) (138 tensors) [DEBUG] stable-diffusion.cpp:398 - loading weights [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/clip_l.safetensors [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/stable-diffusion-webui/models/text_encoder/fp8.safetensors [INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f8_e4m3 | 2 [4096, 32128, 1, 1, 1]' in model file [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/stable-diffusion.cpp/build/bin/flux1-dev-q8_0.gguf [DEBUG] model.cpp:1530 - loading tensors from /home/abdallah/Desktop/webui/ComfyUI/models/vae/flow.safetensors [INFO ] stable-diffusion.cpp:482 - total params memory size = 21471.11MB (VRAM 0.00MB, RAM 21471.11MB): clip 9318.83MB(RAM), unet 12057.71MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM) [INFO ] stable-diffusion.cpp:501 - loading model from '' completed, taking 46.82s [INFO ] stable-diffusion.cpp:518 - running in Flux FLOW mode [DEBUG] stable-diffusion.cpp:572 - finished loaded file [DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512 [DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a lovely cat holding a sign says 'flux.cpp'" [INFO ] stable-diffusion.cpp:655 - Attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s [DEBUG] conditioner.hpp:1036 - parse 'a lovely cat holding a sign says 'flux.cpp'' to [['a lovely cat holding a sign says 'flux.cpp'', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 256 [DEBUG] ggml_extend.hpp:998 - t5 compute buffer size: 68.25 MB(RAM) [DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 12783 ms [INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 12787 ms [INFO ] stable-diffusion.cpp:1279 - sampling using Euler method [INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 42 [DEBUG] ggml_extend.hpp:998 - flux compute buffer size: 397.27 MB(RAM) |==================================================| 4/4 - 86.85s/it [INFO ] stable-diffusion.cpp:1315 - sampling completed, taking 323.77s [INFO ] stable-diffusion.cpp:1323 - generating 1 latent images completed, taking 324.37s [INFO ] stable-diffusion.cpp:1326 - decoding 1 latents [DEBUG] ggml_extend.hpp:998 - vae compute buffer size: 1664.00 MB(RAM) [DEBUG] stable-diffusion.cpp:987 - computing vae [mode: DECODE] graph completed, taking 19.74s [INFO ] stable-diffusion.cpp:1336 - latent 1 decoded, taking 19.74s [INFO ] stable-diffusion.cpp:1340 - decode_first_stage completed, taking 19.74s [INFO ] stable-diffusion.cpp:1449 - txt2img completed in 356.91s save result image to './x.png' ```
al-swaiti commented 2 weeks ago

comparison between Q8 quantization between his method and yours method (both supported on gguf loader) (2 step 10s take on comfyui (used my special merge)

https://github.com/user-attachments/assets/4e31c42f-0d90-4dfa-9054-38ba3909d647 image

al-swaiti commented 2 weeks ago

the closest solution apply patch to llama.cpp to support stablediffusion and others types (unet) ,,,,(comfyui very fast ) the long term solution build application that support gguf types (mix stable.cpp techneque , with speed of comfyui )

city96 commented 2 weeks ago

We don't have bf16 dequantization kernels, which is the reason you're seeing those times.

Please post logs as a collapsible markdown (details) block to avoid spam in discussions/issues. I recommend you edit your posts above.

al-swaiti commented 2 weeks ago

https://github.com/user-attachments/assets/17ac8866-fdc6-478b-8093-90a9fc8b508c

https://github.com/user-attachments/assets/b8e0fdce-1cb5-4104-b6d4-607673bad869

i will try another versions now success ! after patch and quantized @city96 @blepping

al-swaiti commented 2 weeks ago

a problem with sd3 image

blepping commented 2 weeks ago

a problem with sd3

as far as i know, SD3 was never supported. so far there's only support for Flux, SD 1.5 and SDXL. i think SD3 is similar to Flux in terms of architecture so maybe it would be easy to add.

al-swaiti commented 2 weeks ago

any line of code for cpp patch @blepping

al-swaiti commented 2 weeks ago

a problem with sd3 image

i bypass this by image now i have gguf-bf16 but , its not supported by cpp , so i tried to redit the patch (it was 17 years ago last time i used c" reach this code with error

al-swaiti commented 2 weeks ago

the code ,, ignore spelling and forgetten replacment ,

```C diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h index 1d2a3540..b1a9ee96 100644 --- a/ggml/include/ggml.h +++ b/ggml/include/ggml.h @@ -230,7 +230,7 @@ #define GGML_MAX_CONTEXTS 64 #define GGML_MAX_SRC 10 #ifndef GGML_MAX_NAME -#define GGML_MAX_NAME 64 +#define GGML_MAX_NAME 128 #endif #define GGML_MAX_OP_PARAMS 64 #define GGML_DEFAULT_N_THREADS 4 diff --git a/src/llama.cpp b/src/llama.cpp index 5ab65ea9..35580d9d 100644 --- a/src/llama.cpp +++ b/src/llama.cpp @@ -212,6 +212,10 @@ enum llm_arch { LLM_ARCH_JAIS, LLM_ARCH_NEMOTRON, LLM_ARCH_EXAONE, + LLM_ARCH_FLUX, + LLM_ARCH_SD1, + LLM_ARCH_SDXL, + LLM_ARCH_SD3, LLM_ARCH_UNKNOWN, }; @@ -259,6 +263,10 @@ static const std::map LLM_ARCH_NAMES = { { LLM_ARCH_JAIS, "jais" }, { LLM_ARCH_NEMOTRON, "nemotron" }, { LLM_ARCH_EXAONE, "exaone" }, + { LLM_ARCH_FLUX, "flux" }, + { LLM_ARCH_SD1, "sd1" }, + { LLM_ARCH_SDXL, "sdxl" }, + { LLM_ARCH_SD3, "sd3" }, { LLM_ARCH_UNKNOWN, "(unknown)" }, }; @@ -1337,6 +1345,10 @@ static const std::map> LLM_TENSOR_NA { LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" }, }, }, + { LLM_ARCH_FLUX, {}}, + { LLM_ARCH_SD1, {}}, + { LLM_ARCH_SDXL, {}}, + { LLM_ARCH_SD3, {}}, { LLM_ARCH_UNKNOWN, { @@ -4629,6 +4641,12 @@ static void llm_load_hparams( // get general kv ml.get_key(LLM_KV_GENERAL_NAME, model.name, false); + // Disable LLM metadata for image models + if (model.arch == LLM_ARCH_FLUX || model.arch == LLM_ARCH_SD1 || model.arch == LLM_ARCH_SDXL || model.arch == LLM_ARCH_SD3) { + model.ftype = ml.ftype; + return; + } + // get hparams kv ml.get_key(LLM_KV_VOCAB_SIZE, hparams.n_vocab, false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab); @@ -15827,11 +15845,163 @@ static void llama_tensor_dequantize_internal( workers.clear(); } +static ggml_type img_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) { + // Special function for quantizing image model tensors + const std::string name = ggml_get_name(tensor); + const llm_arch arch = qs.model.arch; + + // Sanity check + if ( + (name.find("model.diffusion_model.") != std::string::npos) || + (name.find("first_stage_model.") != std::string::npos) || + (name.find("single_transformer_blocks.") != std::string::npos) + ) { + throw std::runtime_error("Invalid input GGUF file. This is not a supported UNET model"); + } + + // Unsupported quant types - exclude all IQ quants for now + if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || + ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M || + ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || + ftype == LLAMA_FTYPE_MOSTLY_IQ1_M || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || + ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S || + ftype == LLAMA_FTYPE_MOSTLY_IQ3_M || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_4 || + ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_8 || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_8_8) { + throw std::runtime_error("Invalid quantization type for image model (Not supported)"); + } + + if ( // Tensors to keep in FP32 precision + (arch == LLM_ARCH_FLUX) && ( + (name.find("img_in.") != std::string::npos) || + (name.find("time_in.in_layer.") != std::string::npos) || + (name.find("vector_in.in_layer.") != std::string::npos) || + (name.find("guidance_in.in_layer.") != std::string::npos) || + (name.find("final_layer.linear.") != std::string::npos) + ) || (arch == LLM_ARCH_SD1 || arch == LLM_ARCH_SDXL || arch == LLM_ARCH_SD3) && ( + (name.find("conv_in.") != std::string::npos) || + (name.find("conv_out.") != std::string::npos) || + (name == "input_blocks.0.0.weight") || + (name == "out.2.weight") + )) { + new_type = GGML_TYPE_F32; + } else if ( // Tensors to keep in FP16 precision + (arch == LLM_ARCH_FLUX) && ( + (name.find("txt_in.") != std::string::npos) || + (name.find("time_in.") != std::string::npos) || + (name.find("vector_in.") != std::string::npos) || + (name.find("guidance_in.") != std::string::npos) || + (name.find("final_layer.") != std::string::npos) + ) || (arch == LLM_ARCH_SD1 || arch == LLM_ARCH_SDXL || arch == LLM_ARCH_SD3) && ( + (name.find("class_embedding.") != std::string::npos) || + (name.find("time_embedding.") != std::string::npos) || + (name.find("add_embedding.") != std::string::npos) || + (name.find("time_embed.") != std::string::npos) || + (name.find("label_emb.") != std::string::npos) || + (name.find("proj_in.") != std::string::npos) || + (name.find("proj_out.") != std::string::npos) + // (name.find("conv_shortcut.") != std::string::npos) // marginal improvement + )) { + new_type = GGML_TYPE_F16; + } else if ( // Rules for to_v attention + (name.find("attn_v.weight") != std::string::npos) || + (name.find(".to_v.weight") != std::string::npos) + ){ + if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) { + new_type = GGML_TYPE_Q3_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) { + new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) { + new_type = GGML_TYPE_Q5_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) { + new_type = GGML_TYPE_Q6_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) { + new_type = GGML_TYPE_Q5_K; + } + ++qs.i_attention_wv; + } else if ( // Rules for fused qkv attention + (name.find("attn_qkv.weight") != std::string::npos) || + (name.find("attn.qkv.weight") != std::string::npos) + ) { + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) { + new_type = GGML_TYPE_Q4_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) { + new_type = GGML_TYPE_Q5_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) { + new_type = GGML_TYPE_Q6_K; + } + } else if ( // Rules for ffn + (name.find("ffn_down") != std::string::npos) || + (name.find("DenseReluDense.wo") != std::string::npos) + ) { + // TODO: add back `layer_info` with some model specific logic + logic further down + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) { + new_type = GGML_TYPE_Q4_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) { + new_type = GGML_TYPE_Q5_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S) { + new_type = GGML_TYPE_Q5_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) { + new_type = GGML_TYPE_Q6_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) { + new_type = GGML_TYPE_Q6_K; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) { + new_type = GGML_TYPE_Q4_1; + } + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_0) { + new_type = GGML_TYPE_Q5_1; + } + ++qs.i_ffn_down; + } + + // Sanity check for row shape + bool convert_incompatible_tensor = false; + if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K || + new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K) { + int nx = tensor->ne[0]; + int ny = tensor->ne[1]; + if (nx % QK_K != 0) { + LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type)); + convert_incompatible_tensor = true; + } else { + ++qs.n_k_quantized; + } + } + if (convert_incompatible_tensor) { + // TODO: Possibly reenable this in the future + // switch (new_type) { + // case GGML_TYPE_Q2_K: + // case GGML_TYPE_Q3_K: + // case GGML_TYPE_Q4_K: new_type = GGML_TYPE_Q5_0; break; + // case GGML_TYPE_Q5_K: new_type = GGML_TYPE_Q5_1; break; + // case GGML_TYPE_Q6_K: new_type = GGML_TYPE_Q8_0; break; + // default: throw std::runtime_error("\nUnsupported tensor size encountered\n"); + // } + new_type = GGML_TYPE_F16; + LLAMA_LOG_ ```
al-swaiti commented 2 weeks ago

https://github.com/user-attachments/assets/7c5172f5-1a79-45b9-bd59-3f4842e9a99c

why sd3 important i patched it before to create image in 4 step ,,, the half of community interested of video creation almost they using sd1.5 , it will be amazing to use sd3 for that

city96 commented 2 weeks ago

I have SD3 and cascade support working, but I will have to add an exception to keep 4D tensors as actual 4D without reshaping since there's very few of them and adding our key logic on top seems pointless in those cases (better to keep it more standard).

al-swaiti commented 2 weeks ago

there's a big list , of models need to be converted to gguf ,svd, kolors, controlnet , ipadapter ! this is like translate english to another language (gguf)

city96 commented 2 weeks ago

svd

Mediocre 3 GB model, probably pointless. (Also, pretty sure most of the VRAM requirements with video models are from inference, not the model weights).

kolors

Not supported in ComfyUI natively

controlnet

This one would make more sense, depends on how it's applied internally

ipadapter

Relatively small, overhead makes it pointless

al-swaiti commented 2 weeks ago

its calculated by summation of all models inside workflows like this one i used more than one model (base +ipadapter +controlnet) image

blepping commented 2 weeks ago

its calculated by summation of all models inside workflows like this one

you don't have to use all GGUF models. it basically only makes sense to quantize large models, so quantizing (or in other words using GGUF format) for controlnet is going to reduce quality without actually providing a benefit.

al-swaiti commented 2 weeks ago

if i were you ,,, i will look @ GGUF more than Flux ,,, flux is temprory model ! from my experience of using controlnet and loras the quality not affected that much also from my experience of community feedback they prefer smaller size than quality , if you measure the percentage of people who cant use loras nor controlnet with flux model because of the size , it will be 99% because of out of memory ! one day i uploaded vae file model 360 Mb, there some users ask me to upload another version of 150 Mb !!!! i think users think of Mb as they thinking of money ! ๐Ÿ˜‚

city96 commented 2 weeks ago

Absolutely not. We're not going to create that amount of massive overhead for other developers by allowing people to quantize literally anything.

Think about your VAE example for even a second. It's a 360MB static cost for VRAM due to the model weights. You've reduced it to 150MB. The runtime/inference VRAM cost didn't change - VAE decoding a 1920x1080 image will still use about ~6.2GBs of VRAM on top of the weights.

LoRAs also make no sense in our case since we're storing them in system RAM and loading them in weight by weight. The only thing quantizing them would do is slow down inference speed even further.

Flux may be temporary but large, transformer based models are here to stay, and they will continue to be the main focus of this project.

al-swaiti commented 2 weeks ago

but the ram hold the size , and it play critical role in this process , in comfyui they distribute the models process between cpu and vram big models need big ram , i missed this ! . because out of memory (ram)! , whatever i'll return to teach physics in school i haven't found any job in this domain ...bad luck !

city96 commented 2 weeks ago

The RAM usage can still be cut in half because currently ComfyUI makes a copy of the quantized weights when a LoRA is applied even when there's no need for it, so a lot of that is just wasted atm. Ideally, a 300MB LoRA would take 300MB of system RAM and very little extra VRAM to use.

blepping commented 1 week ago

should be able to close this now - support for stable-diffusion.cpp SD15, SDXL and Flux models was merged.

al-swaiti commented 1 week ago

๐Ÿ‘๐Ÿ‘thanks for support see u

On Tue, 3 Sept 2024, 09:08 blepping, @.***> wrote:

should be able to close this now - support for stable-diffusion.cpp SD15, SDXL and Flux models was merged.

โ€” Reply to this email directly, view it on GitHub https://github.com/city96/ComfyUI-GGUF/issues/53#issuecomment-2325675239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW2H23LYZOALQEICF3K5XMDZUVG4ZAVCNFSM6AAAAABM243AT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRVGY3TKMRTHE . You are receiving this because you authored the thread.Message ID: @.***>