THUDM / ChatGLM3

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
Apache License 2.0
13.39k stars 1.55k forks source link

量化加载chatglm3,报错:round_vml_cpu not implemented for Half #1217

Closed imempty closed 4 months ago

imempty commented 5 months ago

System Info / 系統信息

Who can help? / 谁可以帮助到您?

Information / 问题信息

Reproduction / 复现过程

1,加载chatglm3,因为GPU显存不够,尝试量化方式加载,加载语句如下: model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 2,报错日志:


RuntimeError Traceback (most recent call last) Cell In[1], line 18 14 tokenizer = AutoTokenizer.from_pretrained("./chatglm3/", trust_remote_code=True) 15 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).half().cuda() 16 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).half().cuda() 17 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).float().cuda() ---> 18 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 19 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).cuda().quantize(4) 20 model = model.eval()

File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:1212, in ChatGLMForConditionalGeneration.quantize(self, bits, empty_init, device, kwargs) 1208 self.quantized = True 1210 self.config.quantization_bit = bits -> 1212 self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, 1213 kwargs) 1214 return self

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:155, in quantize(model, weight_bit_width, empty_init, device) 153 """Replace fp16 linear with quantized linear""" 154 for layer in model.layers: --> 155 layer.self_attention.query_key_value = QuantizedLinear( 156 weight_bit_width=weight_bit_width, 157 #weight=layer.self_attention.query_key_value.weight, 158 weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()), 159 bias=layer.self_attention.query_key_value.bias, 160 dtype=layer.self_attention.query_key_value.weight.dtype, 161 device=layer.self_attention.query_key_value.weight.device if device is None else device, 162 empty_init=empty_init 163 ) 164 layer.self_attention.dense = QuantizedLinear( 165 weight_bit_width=weight_bit_width, 166 # weight=layer.self_attention.dense.weight, (...) 171 empty_init=empty_init 172 ) 173 layer.mlp.dense_h_to_4h = QuantizedLinear( 174 weight_bit_width=weight_bit_width, 175 # weight=layer.mlp.dense_h_to_4h.weight, (...) 180 empty_init=empty_init 181 )

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:137, in QuantizedLinear.init(self, weight_bit_width, weight, bias, device, dtype, empty_init) 135 self.weight_scale = weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1) 136 # self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8) --> 137 self.weight = torch.round(weight.cpu() / self.weight_scale.cpu()[:, None]).cpu() 138 if weight_bit_width == 4: 139 self.weight = compress_int4_weight(self.weight)

RuntimeError: "round_vml_cpu" not implemented for 'Half'

3,尝试搜索谷歌百度,均未找到可行解决方案

Expected behavior / 期待表现

正常时间低精度量化加载

zRzRzRzRzRzRzR commented 5 months ago

CPU跑步了在线量化,要用到在线算子,你看一下最新代码怎么加载的,hf 和github都要更新

imempty commented 5 months ago

CPU跑步了在线量化,要用到在线算子,你看一下最新代码怎么加载的,hf 和github都要更新

只更新相关Python包就可以解决? 我就用的官网示例加载代码:https://github.com/THUDM/ChatGLM3?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96

zRzRzRzRzRzRzR commented 5 months ago

用到了cuda 算子来量化

mingyue0094 commented 4 months ago

用旧版本的 quantization.py 可解决 https://huggingface.co/THUDM/chatglm3-6b/discussions/47#663cb3c8a4d7c8c9038c5312

前提是 cpu 内存,要足够 大。 或者载入时就用 已经量化后的模型。

mingyue0094 commented 4 months ago

GPU显存 足够完整的载入。那么可以用最新的,载入完整模型,然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入,会进行gpu进行4bit 量化 后模型到gpu,后续在gpu用量化后的模型。 CPU内存,足够完整载入。那么可以用替换旧版本的 quantization.py. 载入完整模型,然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入,会进行cpu进行4bit 量化,然后转移模型到Gpu, 后续用gpu 运行量化后的模型。

内存和显存都不够完整载入。那么你没法使用 `model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 。这种情况,你只能载入 已经量化后的模型。