NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

What are the performance implications of using `spec_decoding_is_generation_length_variable`? #2137

Open ttim opened 3 weeks ago

ttim commented 3 weeks ago

System Info

H100

Who can help?

@byshiue

Information

Tasks

Reproduction

No repro, since it's a question

Expected behavior

No expected behavior

actual behavior

No actual behavior

additional notes

I found spec_decoding_is_generation_length_variable argument in attention API at https://github.com/NVIDIA/TensorRT-LLM/blob/32ed92e4491baf2d54682a21d247e1948cca996e/tensorrt_llm/functional.py#L4750 . What are the performance implications of using different length? Would it affect the performance of attention kernel? Would the attention kernel in this case be equal in performance to the case where all generation lengths are equal to maximum of generation lengths? Thank you!

ttim commented 3 weeks ago

And partially related questions - why there is spec_decoding_generation_lengths argument if there is sequence_lengths argument exists already? Aren't they supposed to be the same?

ttim commented 3 weeks ago

Also if I try to put generation requests of different length into the input tensor it starts failing Assertion failed: seq_len should be same for all generation requests. How do I utilize this feature?

rakib-hasan commented 1 week ago

@ttim Please find my responses below:

What are the performance implications of using different length? Would it affect the performance of attention kernel? Would the attention kernel in this case be equal in performance to the case where all generation lengths are equal to maximum of generation lengths?

This variable was added for functionality of Recurrent Drafter, which inherently can have different generation lengths for different sequences in the batch. In terms of performance, currently it is only used with remove_input_padding so the all the generation tokens are packed together. So, the performance of attention will depend on how many total tokens are being processed. There is another side of it which is the attention mask which is padded to the max length but defaulting to False to avoid useless computation.

why there is spec_decoding_generation_lengths argument if there is sequence_lengths argument exists already? Aren't they supposed to be the same?

spec_decoding_generation_lengths is the number of tokens (1 true token + 1 or more draft tokens) in each generation step. sequence_lengths should be past_kv_lengths + spec_decoding_generation_lengths