Open WainWong opened 1 month ago
Occurs when using glm-4-inference.json,The first time I ran it, I used 8-bit quantization and was prompted with an error as above. After that, I got the same error when I switched to 4-bit quantization. After restarting comfyui, it works fine when I use 4-bit quantization directly.
Unfortunately the quantization value is only set during the initial loading in the "Model Loader" node and does not change since the reloading is done in "Inferencing" or "Prompt Enhancer" node. If you change model the new quantization value will be registered and the model will reload with the new config settings. Planning to fix this later, when I have a little more time.
8-bit version of glm-4v-9b
requires atleast 16GB VRAM and the performance difference is negligible from 4-bit. That's why I changed the default value of quantization to 4-bit for glm-4v-9b
. Then it only requires around 11GB of VRAM.
Also added support for a GPTQ 4-bit and 3-bit version of glm-4v-9b which is really good, it differs very little from the original one and is my personal favorite right now. It's so much faster to load and to infer.
Look for alexwww94/glm-4v-9b-gptq-4bit
in "Model Loader" and try it out.
Only takes around ~8.5GB of hdd space and VRAM.
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set
load_in_8bit_fp32_cpu_offload=True
and pass a customdevice_map
tofrom_pretrained
. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.