NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Convert Qwen2-0.5b Failed When Using INT4 GPTQ #2454

Open ReginaZh opened 4 days ago

ReginaZh commented 4 days ago

System Info

A100

Who can help?

No response

Information

Tasks

Reproduction

cd  /examples/qwen/
git clone https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4
python convert_checkpoint.py --model_dir ./Qwen2-0.5B-Instruct-GPTQ-Int4 --output_dir ./tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq

Expected behavior

Successfully convert and save model checkpoints

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024111200
0.15.0.dev2024111200
438it [00:02, 155.56it/s]
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 308, in <module>
    main()
  File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 300, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 256, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 263, in execute
    f(args, rank)
  File "/app/tensorrt_llm/examples/qwen/convert_checkpoint.py", line 253, in convert_and_save_rank
    qwen.save_checkpoint(args.output_dir, save_config=(rank == 0))
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 688, in save_checkpoint
    safetensors.torch.save_file(
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 286, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 488, in _flatten
    raise RuntimeError(
RuntimeError:
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.vocab_embedding.weight', 'lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

additional notes

transformer version: 4.42.3 TensorRT-LLM version: "0.15.0.dev2024111200"

hello-11 commented 2 days ago

@ReginaZh Could you try to use the latest main branch? We have fixed it.

jershi425 commented 2 days ago

@ReginaZh This issue is fixed but has not been merged into main yet. Before that, you can try this hot fix: Add from collections import Counter to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py and the following after this line


repeated_ptrs = [
    key for key, value in dict(Counter(weights_ptrs)).items()
    if value > 1
]
for key in weights:
    if weights[key].data_ptr() in repeated_ptrs:
        weights[key] = weights[key].clone().detach()```