Open neblen opened 5 days ago
cuda 12 vllm 0.6.3
@sixsixcoder
Using A30 for vllm inference glm 4v 9b
glm 4v 9b use int4 for vllm inference
Using A30 for vllm glm 4v 9b inference shows insufficient cuda memory
Currently glm-4v-9b only supports bf16 type, and inference takes up about 28G cuda memory
System Info / 系統信息
cuda 12 vllm 0.6.3
Who can help? / 谁可以帮助到您?
@sixsixcoder
Information / 问题信息
Reproduction / 复现过程
Using A30 for vllm inference glm 4v 9b
Expected behavior / 期待表现
glm 4v 9b use int4 for vllm inference