Open imilli opened 1 week ago
Thank you very much for finding this issue!
For fixing, we need to change "tensorrt_llm/models/chatglm/convert.py, Line 438":
weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = torch.from_numpy(np.array([qkv_vals_int8['scale_y_quant_orig']], dtype=np.float32)).contiguous()
into
weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = qkv_vals_int8['scale_y_quant_orig'].contiguous()
We will fix it in the next release branch and next week's main branch.
System Info GPU: NVIDIA RTX 4090 TensorRT-LLM 0.13
root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/chatglm# python3 convert_checkpoint.py --chatglm_version glm4 --model_dir "/llm/other/models/glm-4-9b-chat" --output_dir "/llm/other/trt-model" --dtype float16 --use_weight_only --int8_kv_cache --weight_only_precision int8
[TensorRT-LLM] TensorRT-LLM version: 0.13.0 0.13.0 Inferring chatglm version from path... Chatglm version: glm4 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 10/10 [04:35<00:00, 27.53s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Calibration: 100%|█████████████████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.68it/s] Traceback (most recent call last): File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 263, in
main()
File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 255, in main
convert_and_save_hf(args)
File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 213, in convert_and_save_hf
ChatGLMForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/model.py", line 351, in quantize
convert.quantize(hf_model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 723, in quantize
weights = load_weights_from_hf_model(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 438, in load_weights_from_hf_model
np.array([qkv_vals_int8['scale_y_quant_orig']],
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 1084, in array
return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.