量化(int4)按新的方式配置了显示报错

qinzhenyi1314 commented 5 months ago

System Info / 系統信息

感谢开源贡献，自己在尝试用的时候使用的conda环境，都运行起来了，但是(int4)按新的方式配置了，要么报错要么显存不减少

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

按新的方法更新后报错，如果device_map="auto"加上则是显存没有减少 self.model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).quantize(bits=4, device="cuda").cuda().eval()

报错信息 2024-04-07 10:20:04.026 Uncaught app exception Traceback (most recent call last): File "/home/work/miniconda3/envs/ChatGLM3cuda121/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 542, in _run_script exec(code, module.dict) File "/data/workspace/condaProject/ChatGLM3-main/composite_demo/main.py", line 10, in import demo_chat, demo_ci, demo_tool File "/data/workspace/condaProject/ChatGLM3-main/composite_demo/demo_chat.py", line 7, in client = get_client() File "/home/work/miniconda3/envs/ChatGLM3cuda121/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 210, in wrapper return cached_func(*args, *kwargs) File "/home/work/miniconda3/envs/ChatGLM3cuda121/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 239, in call return self._get_or_create_cached_value(args, kwargs) File "/home/work/miniconda3/envs/ChatGLM3cuda121/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value return self._handle_cache_miss(cache, value_key, func_args, func_kwargs) File "/home/work/miniconda3/envs/ChatGLM3cuda121/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 322, in _handle_cache_miss computed_value = self._info.func(func_args, func_kwargs) File "/data/workspace/condaProject/ChatGLM3-main/composite_demo/client.py", line 28, in get_client client = HFClient(MODEL_PATH, TOKENIZER_PATH, PT_PATH) File "/data/workspace/condaProject/ChatGLM3-main/composite_demo/client.py", line 157, in init self.model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).quantize(bits=4, device="cuda").cuda().eval() File "/home/work/.cache/huggingface/modules/transformers_modules/a5ba5501eb873d40d48bd0983bd2a8dd006bb838/modeling_chatglm.py", line 1212, in quantize self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, File "/home/work/.cache/huggingface/modules/transformers_modules/a5ba5501eb873d40d48bd0983bd2a8dd006bb838/quantization.py", line 156, in quantize layer.self_attention.query_key_value = QuantizedLinear( File "/home/work/.cache/huggingface/modules/transformers_modules/a5ba5501eb873d40d48bd0983bd2a8dd006bb838/quantization.py", line 128, in init assert str(weight.device).startswith('cuda'), 'The weights that need to be quantified should be on the CUDA device' AssertionError: The weights that need to be quantified should be on the CUDA device**

Expected behavior / 期待表现

希望帮助确认具体的原因~

qinzhenyi1314 commented 5 months ago

看了下另外的issue 需要更新hugging face里的quantization.py 已更新后验证OK，感谢修正

qinzhenyi1314 commented 5 months ago

关闭issue

aa644728538 commented 5 months ago

请教怎么修改啊？quantization.py文件是这样的 class QuantizedLinear(torch.nn.Module): def init(self, weight_bit_width: int, weight, bias=None, device="cuda", dtype=None, empty_init=False): super().init() weight = weight.to(device) # ensure the weight is on the cuda device assert str(weight.device).startswith('cuda'), 'The weights that need to be quantified should be on the CUDA device' self.weight_bit_width = weight_bit_width

THUDM / ChatGLM3