Open Arnold1 opened 4 months ago
@kaiyux Could you please have a look? Thanks
Hi @Arnold1 , how did you get the benchmark results for triton inference and vllm? Can you share your detailed steps, so I can reproduce your results quickly to see the root cause of the gap?
Hi @Arnold1 , @sunnyqgg were you guys able to figure the root cause here ? I am also observing similar trend for llama2-7b model I am using the latest version of both trt-llm and vllm and respective latest triton servers
Hi @ashwin-js , this's not expected, can you share your steps and commands for both?
@Arnold1 @ashwin-js If you have no further questions, we will close this issue in a week.
System Info
hi,
i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.
i did the following:
questions:
setup:
used gpu:
build tensorrt llm engine and create triton repo: create_trt_engine.txt
started triton inference server and triton inference server model configs: start_triton_inference.txt
benchmark triton inference:
deploy vllm container:
start vllm container:
benchmark vllm:
Who can help?
@hijkzzz @Tracin @yuxianq @Njuapp @uppalutkarsh @nv-guomingz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
is all in code above.
Expected behavior
better performance for concurrent requests and similar performance to vllm
actual behavior
performance degration
additional notes
-