The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:
And the optimum-habana results:
The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Your current environment
The offline inference of Llama-3-8B with benchmark_latency.py sweeping on 1, 2, 4 cards results:
And the optimum-habana results:
The results show that on 1 card vLLM is greater than optimum-habana. But when inference on multi-card, the TP in vLLM performance gain is not good enough, so that the performance is worse than optimum-habana.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.