NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

FA V2 Nonusage during Decode/Generation Phase #2438

Open usajid14 opened 1 week ago

usajid14 commented 1 week ago

Hi,

Is there a specific reason why FA V2 is being used during prefill phase but not during the Generation phase? Is it due to the fact that Flash attention does not give any significant performance yield during decode phase? Thank you!

hello-11 commented 1 week ago

@usajid14 That's correct.