TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Is there a specific reason why FA V2 is being used during prefill phase but not during the Generation phase? Is it due to the fact that Flash attention does not give any significant performance yield during decode phase? Thank you!
Hi,
Is there a specific reason why FA V2 is being used during prefill phase but not during the Generation phase? Is it due to the fact that Flash attention does not give any significant performance yield during decode phase? Thank you!