Franc-Z / QWen1.5_TensorRT-LLM

Optimize QWen1.5 models with TensorRT-LLM
Apache License 2.0
15 stars 3 forks source link

The error occurred while executing the "Check the accuracy of the optimized engine" step in running summarize.py. #2

Open wfd2022 opened 4 months ago

wfd2022 commented 4 months ago

After running the two commands python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16 and trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu --gemm_plugin float16, the engine was successfully built.

After executing the command python3 ../summarize.py --test_trt_llm --hf_model_dir ./tmp/Qwen/7B/ --data_type fp16 --engine_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/ --max_input_length 1024 --output_len 1024, the following error occurred:

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [04/19/2024-07:15:51] [TRT-LLM] [I] Load tokenizer takes: 0.29200124740600586 sec [TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draftlen will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][INFO] Loaded engine size: 14730 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20502, GPU 18136 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +205, GPU +58, now: CPU 20707, GPU 18194 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14727, now: CPU 0, GPU 14727 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20733, GPU 18322 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 20733, GPU 18330 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14727 (MiB) [TensorRT-LLM][WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [TensorRT-LLM][INFO] Max tokens in paged KV cache: 10752. Allocating 5637144576 bytes. [TensorRT-LLM][INFO] Max KV cache pages per sequence: 16 [04/19/2024-07:18:30] [TRT-LLM] [I] Load engine takes: 145.07931542396545 sec Traceback (most recent call last): File "/examples/qwen/../summarize.py", line 686, in main(args) File "/examples/qwen/../summarize.py", line 372, in main output, * = eval_trt_llm(datapoint, File "/examples/qwen/../summarize.py", line 177, in eval_trt_llm batch_input_ids = _prepare_inputs(datapoint[dataset_input_key], File "/examples/qwen/../summarize.py", line 153, in _prepareinputs , input_id_list = make_context( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/utils.py", line 31, in make_context im_start_tokens = [tokenizer.im_start_id] AttributeError: 'Qwen2TokenizerFast' object has no attribute 'im_start_id'