Assertion failed: getMaxInputLen() == modelConfig.getMaxInputLen() with enc_dec with `--context_fmha enable`

thefacetakt commented 2 weeks ago

System Info

Tensorrt-LLM commit: 2a115dae84f13daaa54727534daa837c534eceb4 TensorRT-LLM version: 0.11.0.dev2024061800

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Following official example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) for bart-large-cnn

If while building encoder --context_fmha disable is specified, everything works as expected. But if --context_fmha disable is ommited or --context_fmha enable is specified (which is basically the same thing) while running model python3 ../run.py --engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} --tokenizer_dir tmp/hf_models/${MODEL_NAME} --max_output_len 64 --input_text "translate English to German: The house is wonderful."

A cryptic assert appears

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[07/03/2024-09:21:38] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/run.py", line 503, in <module>
    main(args)
  File "/app/tensorrt_llm/examples/run.py", line 340, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 162, in from_dir
    executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: getMaxInputLen() == modelConfig.getMaxInputLen() (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtEncoderModel.cpp:80)
1       0x7f5ea3fe9e5a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f5ea4014690 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x748690) [0x7f5ea4014690]
3       0x7f5ea5feea83 tensorrt_llm::executor::Executor::Impl::createEncoderModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 611
4       0x7f5ea5fef5b5 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 949
5       0x7f5ea5ff4e0c tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1628
6       0x7f5ea5fe9f53 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99
7       0x7f5f20ef63d1 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb73d1) [0x7f5f20ef63d1]
8       0x7f5f20e92ab3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x53ab3) [0x7f5f20e92ab3]
9       0x55d78870e10e python3(+0x15a10e) [0x55d78870e10e]
10      0x55d788704a7b _PyObject_MakeTpCall + 603
11      0x55d78871cc20 python3(+0x168c20) [0x55d78871cc20]
12      0x55d788719087 python3(+0x165087) [0x55d788719087]
13      0x55d788704e2b python3(+0x150e2b) [0x55d788704e2b]
14      0x7f5f20e88e6d /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x49e6d) [0x7f5f20e88e6d]
15      0x55d788704a7b _PyObject_MakeTpCall + 603
16      0x55d7886fd629 _PyEval_EvalFrameDefault + 27257
17      0x55d78871c7f1 python3(+0x1687f1) [0x55d78871c7f1]
18      0x55d78871d492 PyObject_Call + 290
19      0x55d7886f95d7 _PyEval_EvalFrameDefault + 10791
20      0x55d78870e9fc _PyFunction_Vectorcall + 124
21      0x55d7886f726d _PyEval_EvalFrameDefault + 1725
22      0x55d7886f39c6 python3(+0x13f9c6) [0x55d7886f39c6]
23      0x55d7887e9256 PyEval_EvalCode + 134
24      0x55d788814108 python3(+0x260108) [0x55d788814108]
25      0x55d78880d9cb python3(+0x2599cb) [0x55d78880d9cb]
26      0x55d788813e55 python3(+0x25fe55) [0x55d788813e55]
27      0x55d788813338 _PyRun_SimpleFileObject + 424
28      0x55d788812f83 _PyRun_AnyFileObject + 67
29      0x55d788805a5e Py_RunMain + 702
30      0x55d7887dc02d Py_BytesMain + 45
31      0x7f6165bd1d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6165bd1d90]
32      0x7f6165bd1e40 __libc_start_main + 128
33      0x55d7887dbf25 _start + 37

Expected behavior

context_fmha=enable for bart-large-cnn works

actual behavior

context_fmha=enable for bart-large-cnn results in assert

additional notes

I understand that 0.11.0dev is not a stable version of tensorrt-llm, but, hopefully, this will be fixed in a stable release (or sooner)

QiJune commented 2 weeks ago

@symphonylyh Could you please take a look? Thanks

thefacetakt commented 1 week ago

Seems like getMaxInputLen() is the value set with --max_input_len argument of build-trtllm decreased by 1 when using --context_fmha enable

workaround is to manually decrease max_input_len in config.json after engine is built, but the actual fix would be nice :)

symphonylyh commented 1 week ago

Hi @thefacetakt , this should have been fixed in the latest main branch. We have removed the condition check for enc-dec, and updated the readme for bart.

Closing for now. Feel free to reopen if it doesn't work. Your another Issue will be investigated separately

thefacetakt commented 1 week ago

Thanks, seems to be working now!

NVIDIA / TensorRT-LLM