NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.56k stars 973 forks source link

Why does the whisper model need 17GB of video memory? #805

Open paulxin001 opened 10 months ago

paulxin001 commented 10 months ago

Why does the whisper model need 17GB of video memory? fast-whipser only needs 4G video memory? And I haven't found a way for whisper to quantize Int. Is it not supported now? This video memory occupies too much, is there any way to optimize it?

微信图片_20240104110750
kristiankielhofner commented 10 months ago

It's getting worked on.

yuekaizhang commented 9 months ago

It's getting worked on.

Yeah, you could try the int8 weight only quantization branch, which greatly reduces the memory usage. Also, memory usage should not be a big issue, as the GPU utilization is already high, and freeing up memory would not be used for other tasks. @paulxin001

yuekaizhang commented 9 months ago

@paulxin001 Would you mind trying to removal layernorm plugin and try again? Thank you.

See https://github.com/NVIDIA/TensorRT-LLM/pull/992