Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model

WainWong commented 1 month ago

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

WainWong commented 1 month ago

Occurs when using glm-4-inference.json，The first time I ran it, I used 8-bit quantization and was prompted with an error as above. After that, I got the same error when I switched to 4-bit quantization. After restarting comfyui, it works fine when I use 4-bit quantization directly.

Nojahhh commented 1 month ago

Unfortunately the quantization value is only set during the initial loading in the "Model Loader" node and does not change since the reloading is done in "Inferencing" or "Prompt Enhancer" node. If you change model the new quantization value will be registered and the model will reload with the new config settings. Planning to fix this later, when I have a little more time.

8-bit version of glm-4v-9b requires atleast 16GB VRAM and the performance difference is negligible from 4-bit. That's why I changed the default value of quantization to 4-bit for glm-4v-9b. Then it only requires around 11GB of VRAM.

Also added support for a GPTQ 4-bit and 3-bit version of glm-4v-9b which is really good, it differs very little from the original one and is my personal favorite right now. It's so much faster to load and to infer. Look for alexwww94/glm-4v-9b-gptq-4bit in "Model Loader" and try it out. Only takes around ~8.5GB of hdd space and VRAM.

Nojahhh / ComfyUI_GLM4_Wrapper

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model #3