Closed AlessioNetti closed 5 months ago
I did some more digging on the issue and wanted to quickly follow up: I missed the fact that in example_basic.py
, the first argument to the ExecutorConfig
constructor is actually max_beam_width
, which is by default set to 1.
Setting max_beam_width
to the engine's limit allows to process all requests that have a matching beam width. However, all requests that have a different beam width (i.e., < max_beam_width
) still fail. Ideally, we'd want a single engine to be able to handle requests with different beam widths.
Thanks for the update. We are aware of the limitation that all requests need to have the same beam width. We track this internally and will announce once it has been resolved.
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In the latest
dev
version of TensorRT-LLM, beam search does not work properly when using the Python bindings for the Executor API. In particular, the engine's beam width is simply ignored and any requests with beam width > 1 lead to an assertion error.To reproduce the issue, one can build a Falcon 7B engine with beam width 4 via the following:
One can then submit a request with beam width 4 to the
examples/bindings/executor/example_basic.py
script, modified as follows:Expected behavior
The expected outcome is for the request above to be executed by the engine, and 4 generation beams to be produced. This is what happens when using the
examples/run.py
script, based on the PythonModelRunner
, which can process requests with any number of beams.actual behavior
The request fails with the following error, making it seem like the Executor backend is not able to detect the maximum beam width that the engine was built for:
additional notes
This bug was present also in other dev versions of TensorRT-LLM 0.10 prior to
0.10.0.dev2024050700
. For those, behavior was slightly different: the Executor API would allow only requests using a beam width equal to the maximum that the engine was built for, failing the same assertion as above otherwise.Our general expectation would be to build a single engine with a given maximum beam width, and then be able to process requests with any number of beams from 1 (no beam search) up to this value. This also seems to be the Python
ModelRunner
's behavior.