[BUG/Help] 同一台电脑，同一个Ubuntu环境，THUDM/chatglm2-6b可以正常运行和对话，但是THUDM/chatglm2-6b-int4 对话报错

lollipopyu commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

test.py

from transformers import AutoTokenizer, AutoModel model = "THUDM/chatglm2-6b-int4" tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True) model = AutoModel.from_pretrained(model, trust_remote_code=True).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好", history=[]) print(response) response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) print(response)

错误：

Traceback (most recent call last): File "btest.py", line 4, in response, history = model.chat(tokenizer, "你好", history=[]) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 1028, in chat outputs = self.generate(inputs, gen_kwargs) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/transformers/generation/utils.py", line 1572, in generate return self.sample( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/transformers/generation/utils.py", line 2619, in sample outputs = self( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(input, kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 932, in forward transformer_outputs = self.transformer( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(*input, kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 828, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(*input, *kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 638, in forward layer_ret = layer( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(input, kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 542, in forward attention_output, kv_cache = self.self_attention( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(*input, kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/modeling_chatglm.py", line 374, in forward mixed_x_layer = self.query_key_value(hidden_states) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl return forward_call(*input, *kwargs) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 502, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 75, in forward weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width) File "/home/jzj/.cache/huggingface/modules/transformers_modules/model4/quantization.py", line 299, in extract_weight_to_half func( File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 48, in call func = self._prepare_func() File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 40, in _prepare_func self._module.get_module(), self._func_name File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/kernels/base.py", line 24, in get_module self._module[curr_device] = cuda.cuModuleLoadData(self._code) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 94, in wrapper return f(args, kwargs) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/cuda.py", line 233, in cuModuleLoadData checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data)) File "/home/jzj/miniconda3/envs/glm/lib/python3.8/site-packages/cpm_kernels/library/cuda.py", line 216, in checkCUStatus raise RuntimeError("CUDA Error: %s" % cuGetErrorString(error)) RuntimeError: CUDA Error: no kernel image is available for execution on the device

Expected Behavior

在test.py中，THUDM/chatglm2-6b-int4和THUDM/chatglm2-6b一样，程序能同样正常运行。

Steps To Reproduce

正常运行对话

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).half().cuda() model = model.eval()

正常运行，但无法对话，对话就显示Error，报错。

path = "THUDM/chatglm2-6b-int4" tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) model = AutoModel.from_pretrained(path, trust_remote_code=True).half().cuda() model = model.eval()

Environment

- OS: Ubuntu 20.04
- Python: 3.8.17
- Transformers: 4.30.2
- PyTorch: 1.13.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

是不是chatglm2-6b-int4不兼容问题？

freemantc9527 commented 1 year ago

俺也一样，报错是一模一样的

kehuanfeng commented 1 year ago

I am facing exactly the same error. And it's probably caused by that the hidden quantized code is not compabile with target GPU device. On my case, it's hopper/sm_90 with cuda 12.

jhjade commented 11 months ago

启动了无法对话控制台存在报错 python .\web_demo.py 'gcc' 不是内部或外部命令，也不是可运行的程序或批处理文件。 Compile parallel cpu kernel gcc -O3 -fPIC -pthread -fopenmp -std=c99 C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels_parallel.c -shared -o C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels_parallel.so failed. 'gcc' 不是内部或外部命令，也不是可运行的程序或批处理文件。 Compile cpu kernel gcc -O3 -fPIC -std=c99 C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels.c -shared -o C:\Users\Administrator\.cache\huggingface\modules\transformers_modules\THUDM\chatglm2-6b-int4\382cc704867dc2b78368576166799ace0f89d9ef\quantization_kernels.so failed. D:\chatGLM\ChatGLM2-6B\web_demo.py:90: GradioDeprecationWarning: Thestyle` method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style( Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). C:\ProgramData\anaconda3\envs\chatglm2\lib\site-packages\gradio\helpers.py:818: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. return gr.Textbox(...) instead of return gr.update(...) warnings.warn( C:\ProgramData\anaconda3\envs\chatglm2\lib\site-packages\gradio\components\textbox.py:163: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g.return gr.Textbox(...)instead ofreturn gr.Textbox.update(...). warnings.warn(

jhjade commented 11 months ago

俺也一样，报错是一模一样的

https://github.com/THUDM/ChatGLM2-6B/issues/570#issuecomment-1741276715

THUDM / ChatGLM2-6B