NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

Accuracy Problem: Qwen speculative decoding, different output for num_draft_tokens=2 and num_draft_tokens=5 #2208

Closed jasica528 closed 1 month ago

jasica528 commented 1 month ago

System Info

[TensorRT-LLM] TensorRT-LLM version: 0.11.0 Driver Version: 470.199.02 CUDA Version: 12.4 GPU: A800 1gpu for qwen-14b-chat model, 1gpu for qwen-0.5b-chat model

Who can help?

@kaiyux @bloodeagle40234 @Pzzzzz5142 @pathorn

Information

Tasks

Reproduction

  1. Follow the tutorial of TensorRT-LLM https://github.com/NVIDIA/TensorRT-LLM/blob/v0.11.0/docs/source/speculative_decoding.md
  2. Send requests with --num-draft-tokens=2 and --num-draft-tokens=5 respectively
    python3 tools/inflight_batcher_llm/speculative_decoding_test.py \
    --max-input-len 2048 \
    --dataset=input_data.json \
    --url-target=localhost:8001 \
    --url-draft=localhost:8001 \
    --draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
    --target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \
    --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
    --execute-bls-speculative-decoding \
    --disable-output-comparison \
    --num-draft-tokens=4 \
    --verbose

Expected behavior

when num-draft-tokens = 2 or 5, the target model should return same response. for example, in ${TRITON_REPO}/tensorrt_llm_bls/1/lib/decode.py, _spec_generate()function, the variable cur_preproc should be same as single target model. But i got different cur_preproc when num-draft-tokens = 5, while when num-draft-tokens = 2, cur_preproc is the same as single target model.

actual behavior

the middle-output of --num-draft-tokens=2:

cur_preproc = [[ ......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330]]
draft_request =  DraftRequest(draft_input_ids=array([[100659]], dtype=int32), draft_logits=None)

target model accept response, which also is the value of cur_preproc in the next iteraion :

[[.......,  99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330,  18493]]

the middle-output of --num-draft-tokens=5:

cur_preproc = [[......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815]]
draft_request =  DraftRequest(draft_input_ids=array([[   788,    330, 100659,  18493,  99321]], dtype=int32), draft_logits=None)

target model accept response:

[[......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330, 100659,  18493,  99321,
         99459]]

As we can see, when --num-draft-tokens=5, target model accept an extra token 100659, can anyone answer this difference?

additional notes

Not all the requests show different results when num-draft tokens = 2 and num-draft tokens = 5.

Funatiq commented 1 month ago

What I can see from your logs:

This is the expected behavior. The target model will always produce one more token than the accepted number.