how to convert a finetuned flux model to gguf ?? maybe to Q2 if possible

Meshwa428 commented 1 month ago

I recently got my hands on a H100 VM for 10 days, and i tried to finetune flux on it, and i got pretty good results. I want to run it on my tiny gpu with only 4gb vram, i dont want to use cpu offloading because i want to get best speeds. So thats why I am asking for Q2 conversion. Do you have any custom scripts or Did you use llama.cpp's convert hf to gguf script?

if so then please guide me on how to convert it to Q4_0 gguf atleast

city96 commented 1 month ago

The convert script is in tools but you will have to somehow get the checkpoint to be in the original reference format instead of the diffusers format. Not sure if there's a conversion script since comfy does it on the fly.

As for Q2 see the comment: https://github.com/city96/ComfyUI-GGUF/issues/15#issuecomment-2292163218 - K quants will need a lot of dev effort and I'm not sure how imatrix would work to make under 3 bpw quants feasible.

compilade commented 1 month ago

K quants will need a lot of dev effort

Yes, and Numpy quantization would be extremely slow for k-quants too (around 10x slower than the C implementations). gguf-py doesn't provide PyTorch implementations because dependencies of that library are kept minimal, and torch is relatively big.

BTW @city96 it's very nice to see a good use of the new (de)quantization support in gguf-py which I did not think of. It turns out it's useful for a lot of things! I hope it wasn't too hard to port to torch :)

city96 commented 1 month ago

@compilade The man himself. Didn't expect this to get around this fast, didn't even have time to fix the weight loading logic yet lol.

And yeah, I imagine K quants in numpy wouldn't be all that great, although at least image models are a lot smaller and there's a lot less of them so it might be passable.

Pytorch dequant performance is surprisingly passable when running on the GPU lol, porting it was simple enough, although there's some differences between forcing FP32 for those calculations (seem to have a vram leak there atm) VS doing them directly in FP16 or BF16 (possible precision loss).

compilade commented 1 month ago

@city96

Pytorch dequant performance is surprisingly passable when running on the GPU lol, porting it was simple enough,

That's good to know.

although there's some differences between forcing FP32 for those calculations (seem to have a vram leak there atm) VS doing them directly in FP16 or BF16 (possible precision loss).

You can do it all in FP16 or BF16 if you want. The dequantization functions in gguf-py output FP32 to allow verifying the exactness of their result compared to their reference C implementation, but if it's for inference it doesn't have to be bit-identical (especially if the upstream models are not themselves in FP32 and/or if the GPU supports 16-bit floats).

city96 commented 1 month ago

I specifically meant parts such as this:

    d = blocks[:, :2].view(torch.float16).to(dtype)
    x = blocks[:, 2:].view(torch.int8)
    return (d * x)

In the original, both are cast to FP32 before doing the multiplication, while with pytorch it's possible to just keep them in fp16 after the view and let x be autocast. I guess keeping them in FP32 probably doesn't benefit as much as the speed improvement of being able to use FP16 for the mult.

kakachiex2 commented 1 month ago

Would be a good idea to implement a ComfyUI node for quantization if user wants to Quantz their own custom model so the user only loads the model and queue it for quantization.

city96 commented 1 month ago

@kakachiex2 I think that is not feasible for a few reasons, namely _K quants needing C++ code to quantize properly while also taking way longer. It also brings up the issue of people creating invalid gguf files in all kinds of formats for models that don't even benefit from this method (i.e. SD1.x / SDXL).

@compilade I got _K quants working for the most part, but when testing I noticed something. Key names longer than 64 characters can be added via the python code no problem, but the c++ code throws an error about this. Any idea if there's a hard limit of 64 on the tensor names or if it's just a varchar overflowing somewhere or sth?

main: build = 3600 (2fb92678)
main: built with MSVC 19.36.32535.0 for x64
main: quantizing 'E:\models\unet\EruTest_unet_F16.gguf' to 'E:\models\unet\ggml-model-Q4_K_M.gguf' as Q4_K_M
llama_model_quantize: failed to quantize: tensor 'down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k.weig' data is not within the file bounds, model is corrupted or incomplete
main: failed to quantize model from 'E:\models\unet\EruTest_unet_F16.gguf'

^ Key is cut off at 64 chars exactly. The F16 gguf file is perfectly readable with the python library/online metadata tools as far as I can tell. (This doesn't affect flux, only the test model I was using).

Meshwa428 commented 1 month ago

Are there any possible solutions?

Like increasing the string size?

compilade commented 1 month ago

@city96

Found the cause: the name field of ggml_tensor is limited to GGML_MAX_NAME and it is 64:

https://github.com/ggerganov/llama.cpp/blob/2fb9267887d24a431892ce4dccc75c7095b0d54d/ggml/include/ggml.h#L613

https://github.com/ggerganov/llama.cpp/blob/2fb9267887d24a431892ce4dccc75c7095b0d54d/ggml/include/ggml.h#L233

So either shorten the tensor names, or build llama.cpp with -DGGML_MAX_NAME=128 to make that buffer bigger.

Meshwa428 commented 1 month ago

@city96 can you please tell me how you got _K quants to work?

The above fix by @compilade seems doable, so I might test it out for you 🙂

city96 commented 1 month ago

@compilade Thank you, that makes testing a lot simpler.

@Meshwa428 The key issue doesn't affect flux, only the model I was testing on. I have made a stripped down version of llama.cpp with the LLM specific parts removed to quantize k_quants, currently testing what mix of keys makes sense - because to_q/to_k/to_v is fused into a singke qkv key the default method of keeping those in different precisions may not work so I think a homogenous Q5_K instead of separate Q5_K_M / Q5_K_S makes more sense for now. I also need to test the size/effect of these layers and what the best quant for them would be (not all apply to flux)

"time_embedding.",
"add_embedding.",
"time_in.",
"txt_in.",
"vector_in.",
"img_in.",
"guidance_in.",
"final_layer.",

Meshwa428 commented 1 month ago

@city96 okay, that makes sense now Lol, I was seeing flux the same as the previous diffusion models. 🤣🤣🤦🏼

But I just saw that it's a Dit: diffusion transformer.

Looking forward to your results. 😊

Thank you

city96 commented 1 month ago

Okay, the code to quantize them is a mess so I didn't push that but the K quant support is merged. Q3_K_S being coherent without an imatrix is surprising lol. Even Q2_K seems to be usable for the most part.

Flux1-dev-Q3_K_S

Meshwa428 commented 1 month ago

Hey, just checked out your hugging face page, the q3_k_s model is not there.

Have you released it or was it just for testing?

Really excited to run flux in q3 quants

city96 commented 1 month ago

It's there for dev, schnell is still converting.

city96 commented 1 month ago

Uploaded. Here's a simple comparison of K quants on 4 steps schnell.

Meshwa428 commented 1 month ago

Okay so I just tried to run the models on CPU but the resulting images are completely black

Why?

Is there any precision issue or something?

This is just the same as running fp16 models on unsupported CPU's

RandomGitUser321 commented 1 month ago

Uploaded. Here's a simple comparison of K quants on 4 steps schnell.

That's really wild that q3_k_s and q4_k_s are pretty much identical, considering it's ~1.5gb smaller.

city96 commented 1 month ago

@Meshwa428 I have uploaded the code patch + instructions for creating K quants. All the logic for which keys to keep has been moved to the C++ side, so convert.py only creates the base FP16 or BF16 model. Please read the updated readme in the tools folder for more info. As for black images on CPU - I've not tested CPU inference since I imagine it would take ages with the current simple pytorch dequant kernels.

@RandomGitUser321 I just realized they're 1:1 the same image. I was a bit tired when I made that lol.

Meshwa428 commented 1 month ago

🤣 yeah it took me 30 mins for 1 image

city96 / ComfyUI-GGUF

how to convert a finetuned flux model to gguf ?? maybe to Q2 if possible #11