NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

llama3-70b int8+kv8 convert checkpoint failed on v0.10.0 branch #1814

Open NaNAGISaSA opened 1 week ago

NaNAGISaSA commented 1 week ago

System Info

Who can help?

@Tracin @nv-guomingz

Information

Tasks

Reproduction

model_name=llama3_70b hf_model_dir=/some-path/Meta-Llama-3-70B-Instruct convert_model_dir=/some-path trt_engine_dir=/some-path tp_size=2 # tp_size=4 and tp_size=8 produces the same error

python3 examples/llama/convert_checkpoint.py --model_dir ${hf_model_dir} \ --tp_size ${tp_size} \ --workers ${tp_size} \ --use_weight_only \ --weight_only_precision int8 \ --int8_kv_cache \ --dtype bfloat16 \ --output_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/

Expected behavior

convert success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.10.0 0.10.0 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:23<00:00, 1.27it/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Traceback (most recent call last): File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 464, in main() File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 456, in main convert_and_save_hf(args) File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 360, in convert_and_save_hf LLaMAForCausalLM.quantize(args.model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 414, in quantize convert.quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1387, in quantize act_range, llama_qkv_para, llama_smoother = smooth_quant( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1160, in smooth_quant tokenizer = AutoTokenizer.from_pretrained(model_dir, File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 883, in from_pretrained return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained return cls._from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 169, in init self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False)) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor tokenizer.Load(self.vocab_file) File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

additional notes

I also tested llama3-8b, change hf_model_dir to Meta-Llama-3-8B-Instruct, convertion is success:

[TensorRT-LLM] TensorRT-LLM version: 0.10.0 0.10.0 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.36it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [01:25<00:00, 6.00it/s] Weights loaded. Total time: 00:00:41 Weights loaded. Total time: 00:00:36 Total time of converting checkpoints: 00:03:31

hijkzzz commented 6 days ago

Could you try the latest version TRT_LLM 0.11+ see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

Yoh-Z commented 23 hours ago

Could you try the latest version TRT_LLM 0.11+ see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

Which commit corresponds to version 0.11.0