cli99 / llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference
Apache License 2.0
343 stars 40 forks source link

[REQUEST] Support for paged attention? #16

Closed cnjsdfcy closed 11 months ago

cnjsdfcy commented 11 months ago

Hi,

Will this project support paged-attention? https://vllm.ai/?

Thanks, Jason

cli99 commented 11 months ago

Hi @cnjsdfcy, unfortunately paged-attention support is not on the current todo list of the tool. Since paged-attention is essentially a caching behavior, it's hard to model or make assumption to estimate the system latency or throughput. The tool aims to provide lower-bound performance from the model's point of view (thus no assumption is made about serving system workload or its caching behavior). If you have ideas of how we can model paged-attention, please share. Happy to work on it together.

cnjsdfcy commented 11 months ago

Hi @cli99 , thanks for you reply. Looks like this kind of optimization needs more detailed system-level modeling.

Closing the ticket, thanks.