[BUG/Help] <title>加载cpm_kernels时路径指向python.exe提示[WinError 267] 目录名称无效

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

使用量化方式加载模型时，提示 Failed to load cpm_kernels:[WinError 267] 目录名称无效。: 'C:\\Users\\Hengj\\AppData\\Local\\Programs\\Python\\Python310\\python.exe'

使用Gradio启动时的完整输出：

Loading checkpoint shards:   0%|                                                                 | 0/7 [00:00<?, _?it/s]C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\_utils.py:831:_ UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:08<00:00,  1.17s/it]
Failed to load cpm_kernels:[WinError 267] 目录名称无效。: 'C:\\Users\\Hengj\\AppData\\Local\\Programs\\Python\\Python310\\python.exe'
Traceback (most recent call last):
  File "E:\ChatGLM3\basic_demo\web_demo_gradio.py", line 29, in <module>
    model = AutoModel.from_pretrained("E:\ChatGLM3", trust_remote_code=True, device_map="auto").quantize(4).cuda()
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\modeling_chatglm.py", line 1208, in quantize
    self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 155, in quantize
    layer.self_attention.query_key_value = QuantizedLinear(
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 139, in __init__
    self.weight = compress_int4_weight(self.weight)
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 76, in compress_int4_weight
    blockDim = (min(round_up(m, 32), 1024), 1, 1)
NameError: name 'round_up' is not defined

使用streamlit启动时的完整输出：

Loading checkpoint shards:   0%|                                                                 | 0/7 [00:00<?, ?it/s]C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.55it/s]
Failed to load cpm_kernels:[WinError 267] 目录名称无效。: 'C:\\Users\\Hengj\\AppData\\Local\\Programs\\Python\\Python310\\python.exe'
2024-02-02 00:41:57.327 Uncaught app exception
Traceback (most recent call last):
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 264, in _get_or_create_cached_value
    cached_result = cache.read_result(value_key)
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_resource_api.py", line 498, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 312, in _handle_cache_miss
    cached_result = cache.read_result(value_key)
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_resource_api.py", line 498, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
  File "E:\ChatGLM3\basic_demo\web_demo_streamlit.py", line 37, in <module>
    tokenizer, model = get_model()
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 212, in wrapper
    return cached_func(*args, **kwargs)
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 241, in __call__
    return self._get_or_create_cached_value(args, kwargs)
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 267, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
  File "C:\Users\Hengj\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\caching\cache_utils.py", line 321, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
  File "E:\ChatGLM3\basic_demo\web_demo_streamlit.py", line 32, in get_model
    model = AutoModel.from_pretrained("E:\ChatGLM3", trust_remote_code=True, device_map="cuda").quantize(4).cuda()
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\modeling_chatglm.py", line 1208, in quantize
    self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 155, in quantize
    layer.self_attention.query_key_value = QuantizedLinear(
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 139, in __init__
    self.weight = compress_int4_weight(self.weight)
  File "C:\Users\Hengj\.cache\huggingface\modules\transformers_modules\ChatGLM3\quantization.py", line 76, in compress_int4_weight
    blockDim = (min(round_up(m, 32), 1024), 1, 1)
NameError: name 'round_up' is not defined

Expected Behavior

目录不应指向具体文件

Steps To Reproduce

1.完整安装依赖并下载好模型，不使用量化方式，可正常运行

2.修改web_demo_gradio.py的第29行为 model = AutoModel.from_pretrained("E:\ChatGLM3", trust_remote_code=True, device_map="auto").quantize(4).cuda()，或修改web_demo_streamlit.py的第32行为 model = AutoModel.from_pretrained("E:\ChatGLM3", trust_remote_code=True, device_map="cuda").quantize(4).cuda()

3.在basic_demo目录下运行python web_demo_gradio.py或streamlit run web_demo_streamlit.py

4.Loading checkpoint shards完成后控制台出现所描述错误信息

Environment

- OS:Windows 11 23H2 (22631.3085)
- Python:3.10.6
- Transformers:4.37.1
- PyTorch:2.1.1+cu118
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

THUDM / ChatGLM-6B