NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

v0.8.0 tag trtllm-build does not accept max_draft_len arg #1290

Open ydm-amazon opened 3 months ago

ydm-amazon commented 3 months ago

System Info

TensorRT-LLM v0.8.0 branch https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/tensorrt_llm/commands/build.py versus main branch https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/commands/build.py

Who can help?

@ncomly

Information

Tasks

Reproduction

The max_draft_len parameter is necessary when building the model for speculative decoding. In the main branch, the code to accept this arg is there in lines 122 to 128:

parser.add_argument(
        '--max_draft_len',
        type=int,
        default=0,
        help=
        'Maximum lengths of draft tokens for speculative decoding target model.'
    )

However, this support for max_draft_len is absent in the version tagged v0.8.0 - not sure if it was accidentally missed before the v0.8.0 release. Could it be added to the v0.8.0 version?

Expected behavior

See above

actual behavior

See above

additional notes

N/A

dongxuy04 commented 3 months ago

max_draft_len parameter is added to main branch in the code freeze period of v0.8.0 release, it is not a feature of v0.8.0 release. If it is needed, main branch can be used or wait for later v0.9.0 release.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."