convert_checkpoint report error

System Info GPU： NVIDIA RTX 4090 TensorRT-LLM 0.13

root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/chatglm# python3 convert_checkpoint.py --chatglm_version glm4 --model_dir "/llm/other/models/glm-4-9b-chat" --output_dir "/llm/other/trt-model" --dtype float16 --use_weight_only --int8_kv_cache --weight_only_precision int8

[TensorRT-LLM] TensorRT-LLM version: 0.13.0 0.13.0 Inferring chatglm version from path... Chatglm version: glm4 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 10/10 [04:35<00:00, 27.53s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Calibration: 100%|█████████████████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.68it/s] Traceback (most recent call last): File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 263, in main() File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 255, in main convert_and_save_hf(args) File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 213, in convert_and_save_hf ChatGLMForCausalLM.quantize(args.model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/model.py", line 351, in quantize convert.quantize(hf_model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 723, in quantize weights = load_weights_from_hf_model( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 438, in load_weights_from_hf_model np.array([qkv_vals_int8['scale_y_quant_orig']], File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 1084, in array return self.numpy().astype(dtype, copy=False) TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

NVIDIA / TensorRT-LLM

convert_checkpoint report error #2356