Accuracy Problem: Qwen speculative decoding, different output for num_draft_tokens=2 and num_draft_tokens=5

System Info

[TensorRT-LLM] TensorRT-LLM version: 0.11.0 Driver Version: 470.199.02 CUDA Version: 12.4 GPU: A800 1gpu for qwen-14b-chat model, 1gpu for qwen-0.5b-chat model

Who can help?

@kaiyux @bloodeagle40234 @Pzzzzz5142 @pathorn

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Follow the tutorial of TensorRT-LLM https://github.com/NVIDIA/TensorRT-LLM/blob/v0.11.0/docs/source/speculative_decoding.md

Send requests with --num-draft-tokens=2 and --num-draft-tokens=5 respectively

python3 tools/inflight_batcher_llm/speculative_decoding_test.py \
--max-input-len 2048 \
--dataset=input_data.json \
--url-target=localhost:8001 \
--url-draft=localhost:8001 \
--draft-tensorrt-llm-model-name="${TENSORRT_LLM_DRAFT_MODEL_NAME}" \
--target-tensorrt-llm-model-name="${TENSORRT_LLM_MODEL_NAME}" \
--bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
--execute-bls-speculative-decoding \
--disable-output-comparison \
--num-draft-tokens=4 \
--verbose

Expected behavior

when num-draft-tokens = 2 or 5, the target model should return same response. for example, in ${TRITON_REPO}/tensorrt_llm_bls/1/lib/decode.py, _spec_generate()function, the variable cur_preproc should be same as single target model. But i got different cur_preproc when num-draft-tokens = 5, while when num-draft-tokens = 2, cur_preproc is the same as single target model.

actual behavior

the middle-output of --num-draft-tokens=2:

cur_preproc = [[ ......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330]]

draft_request =  DraftRequest(draft_input_ids=array([[100659]], dtype=int32), draft_logits=None)

target model accept response, which also is the value of cur_preproc in the next iteraion :

[[.......,  99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330,  18493]]

the middle-output of --num-draft-tokens=5:

cur_preproc = [[......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815]]

draft_request =  DraftRequest(draft_input_ids=array([[   788,    330, 100659,  18493,  99321]], dtype=int32), draft_logits=None)

target model accept response:

[[......, 99314,    698,  66017,     25,  14687, 151645,    198, 151644,
         77091,    198,   4913,  27369,    788,    330, 103821, 101151,
           497,    330,  43815,    788,    330, 100659,  18493,  99321,
         99459]]

As we can see, when --num-draft-tokens=5, target model accept an extra token 100659, can anyone answer this difference?

additional notes

Not all the requests show different results when num-draft tokens = 2 and num-draft tokens = 5.

NVIDIA / TensorRT-LLM