NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.7k stars 994 forks source link

Error with runner.generate in TensorRT-LLM 0.14.0 for Qwen Example #2452

Open tedqu opened 2 days ago

tedqu commented 2 days ago

Environment

•   Docker Image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
•   TensorRT-LLM Version: 0.14.0
•   Run Command:

python3 ../run.py \ --input_text "你好,请问你叫什么?" \ --max_output_len=50 \ --tokenizer_dir /data/models/Qwen1.5-7B-Chat/ \ --engine_dir ./tmp/qwen/7B/trt_engines/fp16/1-gpu/

•Example Code: examples/qwen/run.py (from README)

Description

While running the run.py script as described in the README of the examples/qwen/ directory, the following error occurs when invoking runner.generate:

Error Traceback

Traceback (most recent call last): File "/triton/TensorRT-LLM-release-0.14/examples/qwen/../run.py", line 887, in main(args) File "/triton/TensorRT-LLM-release-0.14/examples/qwen/../run.py", line 711, in main outputs = runner.generate( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 624, in generate requests = [ File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 625, in trtllm.Request( TypeError: init(): incompatible constructor arguments. The following argument types are supported:

  1. tensorrt_llm.bindings.executor.Request(input_token_ids: list[int], *, max_tokens: Optional[int] = None, max_new_tokens: Optional[int] = None, streaming: bool = False, sampling_config: tensorrt_llm.bindings.executor.SamplingConfig = SamplingConfig(), output_config: tensorrt_llm.bindings.executor.OutputConfig = OutputConfig(), end_id: Optional[int] = None, pad_id: Optional[int] = None, position_ids: Optional[list[int]] = None, bad_words: Optional[list[list[int]]] = None, stop_words: Optional[list[list[int]]] = None, embedding_bias: Optional[torch.Tensor] = None, external_draft_tokens_config: Optional[tensorrt_llm.bindings.executor.ExternalDraftTokensConfig] = None, prompt_tuning_config: Optional[tensorrt_llm.bindings.executor.PromptTuningConfig = None, lora_config: Optional[tensorrt_llm.bindings.executor.LoraConfig] = None, lookahead_config: Optional[tensorrt_llm.bindings.executor.LookaheadDecodingConfig] = None, logits_post_processor_name: Optional[str] = None, encoder_input_token_ids: Optional[list[int]] = None, client_id: Optional[int] = None, return_all_generated_tokens: bool = False, priority: float = 0.5, type: tensorrt_llm.bindings.executor.RequestType = RequestType.REQUEST_TYPE_CONTEXT_AND_GENERATION, context_phase_params: Optional[tensorrt_llm.bindings.executor.ContextPhaseParams] = None, encoder_input_features: Optional[torch.Tensor] = None, encoder_output_length: Optional[int] = None, num_return_sequences: int = 1)

Invoked with: kwargs: input_token_ids=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 108386, 37945, 56007, 56568, 99882, 99245, 11319, 151645, 198, 151644, 77091, 198], encoder_input_token_ids=None, encoder_output_length=None, encoder_input_features=None, position_ids=None, max_tokens=50, num_return_sequences=None, pad_id=151643, end_id=151645, stop_words=None, bad_words=None, sampling_config=<tensorrt_llm.bindings.executor.SamplingConfig object at 0x7f000502f830>, lookahead_config=None, streaming=False, output_config=<tensorrt_llm.bindings.executor.OutputConfig object at 0x7f0001cca270>, prompt_tuning_config=None, lora_config=None, return_all_generated_tokens=False, logits_post_processor_name=None, external_draft_tokens_config=None

Additional Context

The engine and tokenizer paths are configured as follows: • --tokenizer_dir: /data/models/Qwen1.5-7B-Chat/ • --engine_dir: ./tmp/qwen/7B/trt_engines/fp16/1-gpu/

The engine appears to load successfully, as indicated by the log output:

[TensorRT-LLM][INFO] Engine version 0.14.0 found in the config file, assuming engine(s) built by new builder API. ... [11/18/2024-02:33:18] [TRT-LLM] [I] Load engine takes: 12.188158512115479 sec

However, the error seems to indicate a problem with the argument types for the tensorrt_llm.bindings.executor.Request class, particularly with sampling_config and output_config.

If more logs or information are needed, please let me know! Thank you!

yspch2022 commented 1 day ago

Hellow. I saw same error when I coverted llama3.0 using trt-llm-0.14. So I change trt-llm-0.13, and then I can covert llama3.0 to trt model.

tedqu commented 1 day ago

Thanks !! cool, man I also used a similar method to solve this current problem. I used the run.py file from the previous version of the code, but I'm not sure of the version number, but the problem was also resolved, it should be a small bug in the latest version.

yspch2022 @.***> 于2024年11月19日周二 08:54写道:

Hellow. I saw same error when I coverted llama3.0 using trt-llm-0.14. So I change trt-llm-0.13, and then I can covert llama3.0 to trt model.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/2452#issuecomment-2484483093, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARD4MCRWGRZABRYINEL765T2BKD5LAVCNFSM6AAAAABR6T5UAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUGQ4DGMBZGM . You are receiving this because you authored the thread.Message ID: @.***>