Open akhoroshev opened 1 week ago
it would be great if you called logits post processor for request only if isLastContextChunk() || isGenerationInProgressState()
@akhoroshev thanks for pointing this out.
we will make the change to invoke logits post processor only for the last context chunk.
version
When I build model with paged_context_fmha = true and max_num_tokens = 4096, chunked context is enabled. I see that Executor calls batch_logit_processor more than one time for the first token.
To prove that I'm printing the number of tokens in callback (FusedLogitsProcessor::process is my implementation of callback).
I send request with different input size and set maxTokens to 3.
input_context_size: 18810
input_context_size: 15014
input_context_size: 12585
input_context_size: 8176
You can see that first token logit callback is repeated
ceil(input_context_size / max_num_tokens)
times. In fact, the logits for calls toceil(input_context_size/max_num_tokens) - 1
are ignored (sampling layers are not called) and Executor returns exactly 3 tokens (as expected). But it's very strange to run a logit processor for "garbage" logits.