Working with vllm is much easier than working with tensorrt

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.62k stars 979 forks source link

Working with vllm is much easier than working with tensorrt #2237

Closed Alireza3242 closed 1 month ago

Alireza3242 commented 1 month ago

If you want to serve a model and get openai api and also swagger, With VLLM you have to run only 2 line of code:

pip install vllm
vllm serve facebook/opt-125m

But with tensorrt it is very hard: 1- We have to download model. 2- Download tensorrt-triton docker file 3- Run docker file with mount the model path 4- convert mode 5- build model 6- copy config files from tensorrt-llm-backend for triton 7- start triton

Still we have not openai api or swagger! We only can use triton api which does not support many things such chat template.

aikitoria commented 1 month ago

Would be great if triton had official support for openai compatible api.

lfr-0531 commented 1 month ago

TensorRT-LLM has an OpenAI compatible API, plz refer to this example: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps

aikitoria commented 1 month ago

Interesting! I hadn't seen that yet. But this one doesn't go through triton, to use inflight batching and everything else, right?

lfr-0531 commented 1 month ago

Yes, it doesn't go through triton, but it can use the inflight batching through the python binding of the executor.

aikitoria commented 1 month ago

Interesting! Are there other advantages to using triton, then?

lfr-0531 commented 1 month ago

Performance and C++ compatible API.