After running the two commands python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16 and trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu --gemm_plugin float16, the engine was successfully built.
After executing the command python3 ../summarize.py --test_trt_llm --hf_model_dir ./tmp/Qwen/7B/ --data_type fp16 --engine_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/ --max_input_length 1024 --output_len 1024, the following error occurred:
[TensorRT-LLM] TensorRT-LLM version: 0.9.0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[04/19/2024-07:15:51] [TRT-LLM] [I] Load tokenizer takes: 0.29200124740600586 sec
[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draftlen will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 14730 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20502, GPU 18136 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +205, GPU +58, now: CPU 20707, GPU 18194 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14727, now: CPU 0, GPU 14727 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20733, GPU 18322 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 20733, GPU 18330 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14727 (MiB)
[TensorRT-LLM][WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 10752. Allocating 5637144576 bytes.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
[04/19/2024-07:18:30] [TRT-LLM] [I] Load engine takes: 145.07931542396545 sec
Traceback (most recent call last):
File "/examples/qwen/../summarize.py", line 686, in
main(args)
File "/examples/qwen/../summarize.py", line 372, in main
output, * = eval_trt_llm(datapoint,
File "/examples/qwen/../summarize.py", line 177, in eval_trt_llm
batch_input_ids = _prepare_inputs(datapoint[dataset_input_key],
File "/examples/qwen/../summarize.py", line 153, in _prepareinputs
, input_id_list = make_context(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/utils.py", line 31, in make_context
im_start_tokens = [tokenizer.im_start_id]
AttributeError: 'Qwen2TokenizerFast' object has no attribute 'im_start_id'
After running the two commands python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16 and trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu --gemm_plugin float16, the engine was successfully built.
After executing the command python3 ../summarize.py --test_trt_llm --hf_model_dir ./tmp/Qwen/7B/ --data_type fp16 --engine_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/ --max_input_length 1024 --output_len 1024, the following error occurred:
[TensorRT-LLM] TensorRT-LLM version: 0.9.0 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [04/19/2024-07:15:51] [TRT-LLM] [I] Load tokenizer takes: 0.29200124740600586 sec [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draftlen will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][INFO] Loaded engine size: 14730 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20502, GPU 18136 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +205, GPU +58, now: CPU 20707, GPU 18194 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14727, now: CPU 0, GPU 14727 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20733, GPU 18322 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 20733, GPU 18330 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14727 (MiB) [TensorRT-LLM][WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [TensorRT-LLM][INFO] Max tokens in paged KV cache: 10752. Allocating 5637144576 bytes. [TensorRT-LLM][INFO] Max KV cache pages per sequence: 16 [04/19/2024-07:18:30] [TRT-LLM] [I] Load engine takes: 145.07931542396545 sec Traceback (most recent call last): File "/examples/qwen/../summarize.py", line 686, in
main(args)
File "/examples/qwen/../summarize.py", line 372, in main
output, * = eval_trt_llm(datapoint,
File "/examples/qwen/../summarize.py", line 177, in eval_trt_llm
batch_input_ids = _prepare_inputs(datapoint[dataset_input_key],
File "/examples/qwen/../summarize.py", line 153, in _prepareinputs
, input_id_list = make_context(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/utils.py", line 31, in make_context
im_start_tokens = [tokenizer.im_start_id]
AttributeError: 'Qwen2TokenizerFast' object has no attribute 'im_start_id'