Can glm 4v 9b use int4 for vllm inference

THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型

Apache License 2.0

4.95k stars 413 forks source link

Open neblen opened 5 days ago

neblen commented 5 days ago

cuda 12 vllm 0.6.3

@sixsixcoder

Using A30 for vllm inference glm 4v 9b

glm 4v 9b use int4 for vllm inference

neblen commented 5 days ago

Using A30 for vllm glm 4v 9b inference shows insufficient cuda memory

sixsixcoder commented 5 days ago

Currently glm-4v-9b only supports bf16 type, and inference takes up about 28G cuda memory