TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Cloning into 'Qwen2-0.5B-Instruct'...
remote: Enumerating objects: 33, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 33 (delta 12), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (33/33), 3.60 MiB | 6.54 MiB/s, done.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024073000
0.12.0.dev2024073000
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1429: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
calibrating model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:20<00:00, 24.44it/s]
Weights loaded. Total time: 00:00:18
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 309, in <module>
main()
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 301, in main
convert_and_save_hf(args)
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 228, in convert_and_save_hf
QWenForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 380, in quantize
convert.quantize(hf_model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1207, in quantize
safetensors.torch.save_file(
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 284, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 480, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.vocab_embedding.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
System Info
GPU Type: A6000
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
git clone https://huggingface.co/Qwen/Qwen2-0.5B-Instruct python3 ./convert_checkpoint.py --model_dir ./Qwen2-0.5B-Instruct --output_dir ./tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5
Expected behavior
Successfully convert and save model checkpoints
actual behavior
additional notes
transformer version: 4.42.4 TensorRT-LLM version: "0.12.0.dev2024073000"