[BUG/Help] <title>AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

运行web_demo.py 时报错参数： tokenizer = AutoTokenizer.from_pretrained("E:\pycharm\ChatGLM2-6B\model\chatglm2-6b-int4", trust_remote_code=True) model = AutoModel.from_pretrained("E:\pycharm\ChatGLM2-6B\model\chatglm2-6b-int4", trust_remote_code=True).half().cuda() model = model.quantize(bits=4, kernel_file="E:\pycharm\ChatGLM2-6B\model\chatglm2-6b-int4\quantization_kernels.so")

quantization_kernels.so为手动编译，参考为https://github.com/THUDM/ChatGLM-6B/issues/166

前代的chatglm-6b-int4在量化时似乎也有这样的错误，故参考了一下:

https://github.com/THUDM/ChatGLM-6B/issues/214

https://github.com/THUDM/ChatGLM-6B/issues/162

Failed to load cpm_kernels:name 'CPUKernel' is not defined
欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序

用户：a

ChatGLM：Traceback (most recent call last):
  File "E:\pycharm\ChatGLM2-6B\cli_demo.py", line 62, in <module>
    main()
  File "E:\pycharm\ChatGLM2-6B\cli_demo.py", line 49, in main
    for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history,
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 1058, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 1143, in stream_generate
    outputs = self(
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 932, in forward
    transformer_outputs = self.transformer(
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 828, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 638, in forward
    layer_ret = layer(
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 542, in forward
    attention_output, kv_cache = self.self_attention(
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\modeling_chatglm.py", line 374, in forward
    mixed_x_layer = self.query_key_value(hidden_states)
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization.py", line 502, in forward
    output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
  File "E:\pycharm\ChatGLM2-6B\jieshiqi\lib\site-packages\torch\autograd\function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization.py", line 75, in forward
    weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
  File "C:\Users\14363/.cache\huggingface\modules\transformers_modules\chatglm2-6b-int4\quantization.py", line 287, in extract_weight_to_half
    func = kernels.int4WeightExtractionHalf
AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

进程已结束，退出代码为 1

Expected Behavior

正常运行量化模型（虽然未量化模型勉强能用，但生成效率感人）

Steps To Reproduce

详细报错已经在上面了怀疑是对quantization进行操作时导致错误更改方式类似于这个：https://github.com/THUDM/ChatGLM-6B/issues/166#issuecomment-1484705952

Environment

- OS:windows 10
- Python:3.10
- Transformers: 4.30.2
- PyTorch:2.01+
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

THUDM / ChatGLM-6B

[BUG/Help] <title>AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf' #1354