Why does the whisper model need 17GB of video memory?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.56k stars 973 forks source link

Why does the whisper model need 17GB of video memory? #805

Open paulxin001 opened 10 months ago

paulxin001 commented 10 months ago

Why does the whisper model need 17GB of video memory? fast-whipser only needs 4G video memory? And I haven't found a way for whisper to quantize Int. Is it not supported now? This video memory occupies too much, is there any way to optimize it?

kristiankielhofner commented 10 months ago

It's getting worked on.

yuekaizhang commented 9 months ago

It's getting worked on.

Yeah, you could try the int8 weight only quantization branch, which greatly reduces the memory usage. Also, memory usage should not be a big issue, as the GPU utilization is already high, and freeing up memory would not be used for other tasks. @paulxin001

yuekaizhang commented 9 months ago

@paulxin001 Would you mind trying to removal layernorm plugin and try again? Thank you.

See https://github.com/NVIDIA/TensorRT-LLM/pull/992