Closed cnjsdfcy closed 11 months ago
Hi @cnjsdfcy, unfortunately paged-attention support is not on the current todo list of the tool. Since paged-attention is essentially a caching behavior, it's hard to model or make assumption to estimate the system latency or throughput. The tool aims to provide lower-bound performance from the model's point of view (thus no assumption is made about serving system workload or its caching behavior). If you have ideas of how we can model paged-attention, please share. Happy to work on it together.
Hi @cli99 , thanks for you reply. Looks like this kind of optimization needs more detailed system-level modeling.
Closing the ticket, thanks.
Hi,
Will this project support paged-attention? https://vllm.ai/?
Thanks, Jason