Closed jasica528 closed 1 month ago
What I can see from your logs:
draft_input_ids=array([[100659]], dtype=int32)
, but the draft is rejected. The target model produces a different token: 18493draft_input_ids=array([[ 788, 330, 100659, 18493, 99321]]
, all draft tokens are accepted. The target model produces one additional token: 99459This is the expected behavior. The target model will always produce one more token than the accepted number.
System Info
[TensorRT-LLM] TensorRT-LLM version: 0.11.0 Driver Version: 470.199.02 CUDA Version: 12.4 GPU: A800 1gpu for qwen-14b-chat model, 1gpu for qwen-0.5b-chat model
Who can help?
@kaiyux @bloodeagle40234 @Pzzzzz5142 @pathorn
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
when num-draft-tokens = 2 or 5, the target model should return same response. for example, in ${TRITON_REPO}/tensorrt_llm_bls/1/lib/decode.py, _spec_generate()function, the variable cur_preproc should be same as single target model. But i got different cur_preproc when num-draft-tokens = 5, while when num-draft-tokens = 2, cur_preproc is the same as single target model.
actual behavior
the middle-output of --num-draft-tokens=2:
target model accept response, which also is the value of cur_preproc in the next iteraion :
the middle-output of --num-draft-tokens=5:
target model accept response:
As we can see, when --num-draft-tokens=5, target model accept an extra token 100659, can anyone answer this difference?
additional notes
Not all the requests show different results when num-draft tokens = 2 and num-draft tokens = 5.