NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.47k stars 957 forks source link

Beam search not working properly with Executor API #1574

Closed AlessioNetti closed 5 months ago

AlessioNetti commented 5 months ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

In the latest dev version of TensorRT-LLM, beam search does not work properly when using the Python bindings for the Executor API. In particular, the engine's beam width is simply ignored and any requests with beam width > 1 lead to an assertion error.

To reproduce the issue, one can build a Falcon 7B engine with beam width 4 via the following:

python convert_checkpoint.py --model_dir ./falcon_7b_tp1_instruct/ --dtype bfloat16 --output_dir ./falcon_7b_tp1_instruct_trt_chkpt

trtllm-build --checkpoint_dir ./falcon_7b_tp1_instruct_trt_chkpt/ --gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon_7b_tp1_instruct_p200_g200_b4 --gather_all_token_logits --max_input_len 200 --max_output_len 200 --max_batch_size 64 --max_beam_width 4

One can then submit a request with beam width 4 to the examples/bindings/executor/example_basic.py script, modified as follows:

diff --git a/examples/bindings/executor/example_basic.py b/examples/bindings/executor/example_basic.py
index 2c7a3fc..97364ee 100644
--- a/examples/bindings/executor/example_basic.py
+++ b/examples/bindings/executor/example_basic.py
@@ -22,7 +22,8 @@ if __name__ == "__main__":
     if executor.can_enqueue_requests():
         # Create the request.
         request = trtllm.Request(input_token_ids=[1, 2, 3, 4],
-                                 max_new_tokens=10)
+                                 max_new_tokens=10,
+                                 sampling_config=trtllm.SamplingConfig(beam_width=4))

         # Enqueue the request.
         request_id = executor.enqueue_request(request)

Expected behavior

The expected outcome is for the request above to be executed by the engine, and 4 generation beams to be produced. This is what happens when using the examples/run.py script, based on the Python ModelRunner, which can process requests with any number of beams.

actual behavior

The request fails with the following error, making it seem like the Executor backend is not able to detect the maximum beam width that the engine was built for:

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Request with beam width 4 differs from the max beam width 1 (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:577)
1       0x7fa20e426304 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fa20e448e7f /virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b2e7f) [0x7fa20e448e7f]
3       0x7fa210394741 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2177
4       0x7fa2103b93a7 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 71
5       0x7fa2103bc26c tensorrt_llm::executor::Executor::Impl::executionLoop() + 396
6       0x7fa299eb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa299eb0253]
7       0x7fa376d91ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa376d91ac3]
8       0x7fa376e23850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa376e23850]

additional notes

This bug was present also in other dev versions of TensorRT-LLM 0.10 prior to 0.10.0.dev2024050700. For those, behavior was slightly different: the Executor API would allow only requests using a beam width equal to the maximum that the engine was built for, failing the same assertion as above otherwise.

Our general expectation would be to build a single engine with a given maximum beam width, and then be able to process requests with any number of beams from 1 (no beam search) up to this value. This also seems to be the Python ModelRunner's behavior.

AlessioNetti commented 5 months ago

I did some more digging on the issue and wanted to quickly follow up: I missed the fact that in example_basic.py, the first argument to the ExecutorConfig constructor is actually max_beam_width, which is by default set to 1.

Setting max_beam_width to the engine's limit allows to process all requests that have a matching beam width. However, all requests that have a different beam width (i.e., < max_beam_width) still fail. Ideally, we'd want a single engine to be able to handle requests with different beam widths.

MartinMarciniszyn commented 5 months ago

Thanks for the update. We are aware of the limitation that all requests need to have the same beam width. We track this internally and will announce once it has been resolved.