Open wardensc2 opened 1 month ago
Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance
Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).
I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).
I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220
I see that is the python gguf module and quants method used.
However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.
EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).
I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220
I see that is the python gguf module and quants method used.
However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.
EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...
I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.
Can you make a Q4 GGUF for the 5B-I2V model?
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).
I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220
I see that is the python gguf module and quants method used. However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version. EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...
I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.
You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337
Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.
You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337
Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.
You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.
This is quite cumbersome. At least it is not as easy as imagined.
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.
You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337 Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.
You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.
This is quite cumbersome. At least it is not as easy as imagined.
Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:
Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.
You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337 Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.
You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference. This is quite cumbersome. At least it is not as easy as imagined.
Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?
There is no such script. I quickly implemented it by directly rewriting the local Comfyui code. The chaotic code is mixed with the local Comfyui. After all, at the very beginning, I only wanted it to be used once.
However, I can provide a code snippet rewritten in the form of torch.
def split_block_dims(blocks, *args):
n_max = blocks.shape[1]
dims = list(args) + [n_max - sum(args)]
return torch.split(blocks, dims, dim=1)
def quant_shape_to_byte_shape(shape, qtype) -> tuple[int, ...]:
# shape = shape[::-1]
block_size, type_size = GGML_QUANT_SIZES[qtype]
if shape[-1] % block_size != 0:
raise ValueError(
f"Quantized tensor row size ({shape[-1]}) is not a multiple of Q4_0 block size ({block_size})")
return (*shape[:-1], shape[-1] // block_size * type_size)
def quantize_blocks_Q4_0(data):
block_size, type_size = GGML_QUANT_SIZES["Q4_0"]
original_shape = data.shape
rows_size = data.numel()
n_blocks = rows_size // block_size
data = data.reshape((n_blocks, block_size))
n_blocks = data.shape[0]
max = torch.max(torch.abs(data), dim=-1, keepdim=True).values
d = max / -8
id = torch.where(d == 0, 0, 1 / d)
qs = torch.trunc((data * id) + 8.5).to(torch.uint8).clamp(0, 15)
qs = qs.reshape((n_blocks, 2, block_size // 2))
qs = qs[..., 0, :] | (qs[..., 1, :] << 4)
d = d.to(torch.float16).view(torch.uint8)
out = torch.cat([d, qs], dim=-1)
out = out.reshape(quant_shape_to_byte_shape(original_shape, qtype="Q4_0"))
return out
Thank you for your work with quantizing the models! This is all new to me, but thanks to this discussion and the provided snippet I think I managed to quant the I2V model similarly, at least it works:
https://huggingface.co/Kijai/CogVideoX_GGUF/blob/main/CogVideoX_5b_I2V_GGUF_Q4_0.safetensors
@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)
@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)
I believe MinusZone AI has done that in this repo, I haven't tried that though.
@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)
I believe MinusZone AI has done that in this repo, I haven't tried that though.
I only perform pipe.enable_sequential_cpu_offload() in the loader. I think it should be effective.
Most of the time, the OOM I encounter occurs in vae encode.
The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM.
With the setting in the image (main_device) it is much slower than using the transformer models, because it spills into shared memory (RAM). Wailovet probably means VRAM minimization but what I and probably many others with 8-12GB cards prefer is optimization in the way WebUI Forge or llama.cpp has, fitting it to the available VRAM or even letting us pick how much it consumes. I tried bringing this up to the original implementation repo and they imply that's not planned (by telling me to keep using the enable_sequential_cpu_offload option)
Hi @minuszoneAI
Please upload GGUF model to hugging face. The link from china server is very slow to download.
Thank you