Open lollipopyu opened 1 year ago
俺也一样,报错是一模一样的
I am facing exactly the same error. And it's probably caused by that the hidden quantized code is not compabile with target GPU device. On my case, it's hopper/sm_90 with cuda 12.
启动了无法对话控制台存在报错
python .\web_demo.py 'gcc' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 Compile parallel cpu kernel gcc -O3 -fPIC -pthread -fopenmp -std=c99 C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels_parallel.c -shared -o C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels_parallel.so failed. 'gcc' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 Compile cpu kernel gcc -O3 -fPIC -std=c99 C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels.c -shared -o C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels.so failed. D:\chatGLM\ChatGLM2-6B\web_demo.py:90: GradioDeprecationWarning: The
style` method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
C:\ProgramData\anaconda3\envs\chatglm2\lib\site-packages\gradio\helpers.py:818: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. return gr.Textbox(...)
instead of return gr.update(...) warnings.warn( C:\ProgramData\anaconda3\envs\chatglm2\lib\site-packages\gradio\components\textbox.py:163: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g.
return gr.Textbox(...)instead of
return gr.Textbox.update(...). warnings.warn(
Is there an existing issue for this?
Current Behavior
test.py
from transformers import AutoTokenizer, AutoModel model = "THUDM/chatglm2-6b-int4" tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True) model = AutoModel.from_pretrained(model, trust_remote_code=True).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好", history=[]) print(response) response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) print(response)
错误:
Traceback (most recent call last): File "btest.py", line 4, in
response, history = model.chat(tokenizer, "你好", history=[])
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 1028, in chat
outputs = self.generate(inputs, gen_kwargs)
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, *kwargs)
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(input, kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 932, in forward
transformer_outputs = self.transformer(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 828, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, *kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 638, in forward
layer_ret = layer(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(input, kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 542, in forward
attention_output, kv_cache = self.self_attention(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 374, in forward
mixed_x_layer = self.query_key_value(hidden_states)
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, *kwargs)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 502, in forward
output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 75, in forward
weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 299, in extract_weight_to_half
func(
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 48, in call
func = self._prepare_func()
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 40, in _prepare_func
self._module.get_module(), self._func_name
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 24, in get_module
self._module[curr_device] = cuda.cuModuleLoadData(self._code)
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper
return f(args, kwargs)
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/cuda.py", line 233, in cuModuleLoadData
checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data))
File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/cuda.py", line 216, in checkCUStatus
raise RuntimeError("CUDA Error: %s" % cuGetErrorString(error))
RuntimeError: CUDA Error: no kernel image is available for execution on the device
Expected Behavior
在test.py中,THUDM/chatglm2-6b-int4和THUDM/chatglm2-6b一样,程序能同样正常运行。
Steps To Reproduce
正常运行对话
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).half().cuda() model = model.eval()
正常运行,但无法对话,对话就显示Error,报错。
path = "THUDM/chatglm2-6b-int4" tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) model = AutoModel.from_pretrained(path, trust_remote_code=True).half().cuda() model = model.eval()
Environment
Anything else?
是不是chatglm2-6b-int4不兼容问题?