Closed DreamGenX closed 3 months ago
@kaiyux @QiJune could you help to comment on this issue?
@DreamGenX Supported in https://github.com/NVIDIA/TensorRT-LLM/blob/2a115dae84f13daaa54727534daa837c534eceb4/tensorrt_llm/commands/build.py#L81 please use the latest main branch.
System Info
Hello, I am building a llama 3 70b engine. If I do not specify
--max_input_len
and--max_output_len
then requests are capped at 1024 tokens for some reason. Ideally I want the input len and output len be flexible, and allow any combination that adds up to 8192 (the native length of the model).Is there a way to do this?
Right now I set the following when using trtllm-build in order to do both long input and output, but I am worried it might have some side effects (e.g. does it somehow affect RoPE?):
This generated the following engine config:
But when I load the model, I see this in the logs:
I am not sure what this is based on. It's possible that it's harmless...
Who can help?
@juney-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
It's a documentation question.
Expected behavior
N/A
actual behavior
N/A
additional notes
N/A