aws-neuron / transformers-neuronx

Apache License 2.0
88 stars 25 forks source link

Serving Throughput Optimizations (e.g. PagedAttention) #52

Closed vigneshv59 closed 6 days ago

vigneshv59 commented 9 months ago

Projects like vLLM help optimize model serving throughput, I was wondering if implementing PagedAttention or integrating with vLLM was something that was on your roadmap to improve using the Inf2 processors in production?

jyang-aws commented 7 months ago

@vigneshv59 sorry for the late reply. we're actively working on it and we have plans for supporting this in a future release. Will keep you posted once it's ready.

aws-maens commented 7 months ago

@vigneshv59 - To clarify the answer above, we are actively working on Continues Batching, PagedAttention is a backlog item for now.

aws-rhsoln commented 6 days ago

We have added an example of continuous batching using VLLM here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html Closing this ticket. Please re-open if there are any further issues.