MinusZoneAI / ComfyUI-CogVideoX-MZ

CogVideoX-5B 4-bit quantization model
GNU General Public License v3.0
92 stars 3 forks source link

Please upload GGUF model to hugging face link #7

Open wardensc2 opened 1 month ago

wardensc2 commented 1 month ago

Hi @minuszoneAI

Please upload GGUF model to hugging face. The link from china server is very slow to download.

Thank you

wailovet commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

wardensc2 commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Thank you so much

wardensc2 commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?

Thank in advance

wailovet commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?

Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

phr00t commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ? Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

wailovet commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ? Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

phr00t commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ? Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used.

However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.

EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

wailovet commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ? Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used.

However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.

EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.

realisticdreamer114514 commented 1 month ago

Can you make a Q4 GGUF for the 5B-I2V model?

phr00t commented 1 month ago

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ? Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used. However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version. EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L554

wailovet commented 1 month ago

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

phr00t commented 1 month ago

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:

https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337

Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

wailovet commented 1 month ago

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:

https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337

Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.

This is quite cumbersome. At least it is not as easy as imagined.

phr00t commented 1 month ago

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337 Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.

This is quite cumbersome. At least it is not as easy as imagined.

Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?

wailovet commented 1 month ago

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337 Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference. This is quite cumbersome. At least it is not as easy as imagined.

Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?

There is no such script. I quickly implemented it by directly rewriting the local Comfyui code. The chaotic code is mixed with the local Comfyui. After all, at the very beginning, I only wanted it to be used once.

However, I can provide a code snippet rewritten in the form of torch.


def split_block_dims(blocks, *args):
    n_max = blocks.shape[1]
    dims = list(args) + [n_max - sum(args)]
    return torch.split(blocks, dims, dim=1)

def quant_shape_to_byte_shape(shape, qtype) -> tuple[int, ...]:
    # shape = shape[::-1]
    block_size, type_size = GGML_QUANT_SIZES[qtype]
    if shape[-1] % block_size != 0:
        raise ValueError(
            f"Quantized tensor row size ({shape[-1]}) is not a multiple of Q4_0 block size ({block_size})")
    return (*shape[:-1], shape[-1] // block_size * type_size)

def quantize_blocks_Q4_0(data):
    block_size, type_size = GGML_QUANT_SIZES["Q4_0"]

    original_shape = data.shape

    rows_size = data.numel()
    n_blocks = rows_size // block_size
    data = data.reshape((n_blocks, block_size))

    n_blocks = data.shape[0]

    max = torch.max(torch.abs(data), dim=-1, keepdim=True).values

    d = max / -8
    id = torch.where(d == 0, 0, 1 / d)

    qs = torch.trunc((data * id) + 8.5).to(torch.uint8).clamp(0, 15)
    qs = qs.reshape((n_blocks, 2, block_size // 2))
    qs = qs[..., 0, :] | (qs[..., 1, :] << 4)

    d = d.to(torch.float16).view(torch.uint8)

    out = torch.cat([d, qs], dim=-1)

    out = out.reshape(quant_shape_to_byte_shape(original_shape, qtype="Q4_0"))

    return out
kijai commented 1 month ago

Thank you for your work with quantizing the models! This is all new to me, but thanks to this discussion and the provided snippet I think I managed to quant the I2V model similarly, at least it works:

https://huggingface.co/Kijai/CogVideoX_GGUF/blob/main/CogVideoX_5b_I2V_GGUF_Q4_0.safetensors

realisticdreamer114514 commented 1 month ago

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? image Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)

kijai commented 1 month ago

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? image Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)

I believe MinusZone AI has done that in this repo, I haven't tried that though.

wailovet commented 1 month ago

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? image Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)

I believe MinusZone AI has done that in this repo, I haven't tried that though.

I only perform pipe.enable_sequential_cpu_offload() in the loader. I think it should be effective.

Most of the time, the OOM I encounter occurs in vae encode.

realisticdreamer114514 commented 1 month ago

The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM.

With the setting in the image (main_device) it is much slower than using the transformer models, because it spills into shared memory (RAM). Wailovet probably means VRAM minimization but what I and probably many others with 8-12GB cards prefer is optimization in the way WebUI Forge or llama.cpp has, fitting it to the available VRAM or even letting us pick how much it consumes. I tried bringing this up to the original implementation repo and they imply that's not planned (by telling me to keep using the enable_sequential_cpu_offload option)