Open ttim opened 3 weeks ago
And partially related questions - why there is spec_decoding_generation_lengths
argument if there is sequence_lengths
argument exists already? Aren't they supposed to be the same?
Also if I try to put generation requests of different length into the input tensor it starts failing Assertion failed: seq_len should be same for all generation requests
. How do I utilize this feature?
@ttim Please find my responses below:
What are the performance implications of using different length? Would it affect the performance of attention kernel? Would the attention kernel in this case be equal in performance to the case where all generation lengths are equal to maximum of generation lengths?
This variable was added for functionality of Recurrent Drafter, which inherently can have different generation lengths for different sequences in the batch. In terms of performance, currently it is only used with remove_input_padding
so the all the generation tokens are packed together. So, the performance of attention will depend on how many total tokens are being processed.
There is another side of it which is the attention mask which is padded to the max length but defaulting to False to avoid useless computation.
why there is spec_decoding_generation_lengths argument if there is sequence_lengths argument exists already? Aren't they supposed to be the same?
spec_decoding_generation_lengths
is the number of tokens (1 true token + 1 or more draft tokens) in each generation step.
sequence_lengths
should be past_kv_lengths + spec_decoding_generation_lengths
System Info
H100
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
No repro, since it's a question
Expected behavior
No expected behavior
actual behavior
No actual behavior
additional notes
I found
spec_decoding_is_generation_length_variable
argument in attention API at https://github.com/NVIDIA/TensorRT-LLM/blob/32ed92e4491baf2d54682a21d247e1948cca996e/tensorrt_llm/functional.py#L4750 . What are the performance implications of using different length? Would it affect the performance of attention kernel? Would the attention kernel in this case be equal in performance to the case where all generation lengths are equal to maximum of generation lengths? Thank you!