HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
38 stars 42 forks source link

[Usage]: The TP improvement is not as expectation #274

Open JunxiChhen opened 1 month ago

JunxiChhen commented 1 month ago

Your current environment

The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:

image

And the optimum-habana results:

image

The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

wpyszka commented 2 weeks ago

@JunxiChhen, Vllm performance improvements are planned in SW1.19 version. Please stay tuned.