Implement MInference for faster 10x faster prefill on proprietary fmha kernel

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.12k stars 896 forks source link

Implement MInference for faster 10x faster prefill on proprietary fmha kernel #1896

Closed avianion closed 2 weeks ago

avianion commented 2 months ago

https://arxiv.org/pdf/2407.02490

As according to this paper, 10x prefill rates can be achieved using different attention mechanisms.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 15 days with no activity.