Closed thefacetakt closed 1 week ago
@symphonylyh Could you please take a look? Thanks
Seems like getMaxInputLen()
is the value set with --max_input_len
argument of build-trtllm
decreased by 1 when using --context_fmha enable
workaround is to manually decrease max_input_len
in config.json after engine is built, but the actual fix would be nice :)
Hi @thefacetakt , this should have been fixed in the latest main branch. We have removed the condition check for enc-dec, and updated the readme for bart.
Closing for now. Feel free to reopen if it doesn't work. Your another Issue will be investigated separately
Thanks, seems to be working now!
System Info
Tensorrt-LLM commit: 2a115dae84f13daaa54727534daa837c534eceb4 TensorRT-LLM version: 0.11.0.dev2024061800
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Following official example (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) for bart-large-cnn
If while building encoder
--context_fmha disable
is specified, everything works as expected. But if--context_fmha disable
is ommited or--context_fmha enable
is specified (which is basically the same thing) while running modelpython3 ../run.py --engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} --tokenizer_dir tmp/hf_models/${MODEL_NAME} --max_output_len 64 --input_text "translate English to German: The house is wonderful."
A cryptic assert appears
Expected behavior
context_fmha=enable for bart-large-cnn works
actual behavior
context_fmha=enable for bart-large-cnn results in assert
additional notes
I understand that 0.11.0dev is not a stable version of tensorrt-llm, but, hopefully, this will be fixed in a stable release (or sooner)