THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.95k stars 413 forks source link

Can glm 4v 9b use int4 for vllm inference #594

Open neblen opened 5 days ago

neblen commented 5 days ago

System Info / 系統信息

cuda 12 vllm 0.6.3

Who can help? / 谁可以帮助到您?

@sixsixcoder

Information / 问题信息

Reproduction / 复现过程

Using A30 for vllm inference glm 4v 9b

Expected behavior / 期待表现

glm 4v 9b use int4 for vllm inference

neblen commented 5 days ago

Using A30 for vllm glm 4v 9b inference shows insufficient cuda memory

sixsixcoder commented 5 days ago

Currently glm-4v-9b only supports bf16 type, and inference takes up about 28G cuda memory