Mixtral optimization from vllm

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.78k stars 1.01k forks source link

Mixtral optimization from vllm #672

Open 0xymoro opened 11 months ago

0xymoro commented 11 months ago

Putting this here, latency change seems very substantial:

https://github.com/vllm-project/vllm/pull/2090

jdemouth-nvidia commented 11 months ago

Thanks for the pointer. We will take a look at it.

hello-11 commented 1 week ago

@0xymoro Do you still have the problem? If not, we will close it soon.