NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.51k stars 965 forks source link

convert_checkpoint report error #2356

Open imilli opened 1 week ago

imilli commented 1 week ago

System Info GPU: NVIDIA RTX 4090 TensorRT-LLM 0.13

root@docker-desktop:/llm/tensorrt-llm-0.13.0/examples/chatglm# python3 convert_checkpoint.py --chatglm_version glm4 --model_dir "/llm/other/models/glm-4-9b-chat" --output_dir "/llm/other/trt-model" --dtype float16 --use_weight_only --int8_kv_cache --weight_only_precision int8

[TensorRT-LLM] TensorRT-LLM version: 0.13.0 0.13.0 Inferring chatglm version from path... Chatglm version: glm4 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████| 10/10 [04:35<00:00, 27.53s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Calibration: 100%|█████████████████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.68it/s] Traceback (most recent call last): File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 263, in main() File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 255, in main convert_and_save_hf(args) File "/llm/tensorrt-llm-0.13.0/examples/chatglm/convert_checkpoint.py", line 213, in convert_and_save_hf ChatGLMForCausalLM.quantize(args.model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/model.py", line 351, in quantize convert.quantize(hf_model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 723, in quantize weights = load_weights_from_hf_model( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/chatglm/convert.py", line 438, in load_weights_from_hf_model np.array([qkv_vals_int8['scale_y_quant_orig']], File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 1084, in array return self.numpy().astype(dtype, copy=False) TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

wili-65535 commented 1 week ago

Thank you very much for finding this issue!

For fixing, we need to change "tensorrt_llm/models/chatglm/convert.py, Line 438":

weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = torch.from_numpy(np.array([qkv_vals_int8['scale_y_quant_orig']], dtype=np.float32)).contiguous()

into

weights[f'{tllm_prex}.attention.kv_cache_scaling_factor'] = qkv_vals_int8['scale_y_quant_orig'].contiguous()

We will fix it in the next release branch and next week's main branch.