NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.15k stars 897 forks source link

WOQ is not giving any performance speedup in whisper #1651

Open robosina opened 3 months ago

robosina commented 3 months ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

To reproduce this issue, build the Whisper medium model in two versions: one as normal and the other using weight-only quantization (WOQ).

Expected Behavior

The version built with WOQ is expected to have better performance.

actual behavior

********
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_bs10_bw5_FP16
RTF: 0.0174
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.357 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.62%
********
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq_bs10_bw5_FP16
RTF: 0.0174
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.365 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
********

Additional Notes

I think this issue may be related to GPU architecture. Is there another method for using quantization with the Whisper model?

yuekaizhang commented 3 months ago

I think this issue may be related to GPU architecture. Is there another method for using quantization with the Whisper model?

Yes, WOQ int8 cannot significantly improve inference speed, but it can reduce memory usage while maintaining the same speed. For example, the memory usage of Whisper Large can be reduced by 1.5 GB. In the future, we will support the FP8 quantization scheme, which can improve inference speed. For the Ampere architecture, SmoothQuant int8 can also speed up inference, but its support priority will be after FP8.

Additionally, you can try the inference of Whisper Large v3. On the A100, it has better accuracy and is not much slower than the Medium model. @robosina

robosina commented 3 months ago

@yuekaizhang Thanks for the answer, but please check the following information:

**********************************************************
running benchmark for whisper_medium_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_bs10_bw5_FP16
RTF: 0.0172
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.285 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.62%
total 1.6G
-rw-r--r-- 1 root root 1.7K May 22 20:06 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 22 20:06 encoder_config.json
-rw-r--r-- 1 root root 978M May 22 20:06 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 589M May 22 20:06 whisper_encoder_float16_tp1_rank0.engine
**********************************************************
running benchmark for whisper_medium_woq_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq_bs10_bw5_FP16
RTF: 0.0178
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.556 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
total 897M
-rw-r--r-- 1 root root 1.7K May 22 20:16 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 22 20:16 encoder_config.json
-rw-r--r-- 1 root root 595M May 22 20:16 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 302M May 22 20:16 whisper_encoder_float16_tp1_rank0.engine
**********************************************************
running benchmark for whisper_medium_woq4_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq4_bs10_bw5_FP16
RTF: 0.0181
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.697 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
total 561M
-rw-r--r-- 1 root root 1.7K May 23 11:06 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 23 11:05 encoder_config.json
-rw-r--r-- 1 root root 403M May 23 11:06 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 158M May 23 11:05 whisper_encoder_float16_tp1_rank0.engine

Based on the above information I'm sure that the second and third models are using WOQ since they are using less disk space(ls infos). However, please check the memory consumption.

gpu_usage_plot

In this case, the second one uses the WOQ method, and the third one is using woq-4bit As you can see, the memory consumption is not significantly better. Is this usual, or am I encountering a bug here? Thanks in advance.

yuekaizhang commented 3 months ago

processing time: 8.697 seconds (0.00 hours) batch size: 10 num_beams: 5 total error rate: 4.70%

@robosina Hi, thanks for your investigation. I have done some perf jobs. And indeed, found that whisper encoder fp16 is faster than int8 WOQ. We would work on the issue and update here.

For Whisper Decoder, WOQ int8 should be faster than fp16.

For the VRAM usage, whisper medium has 0.7B parameters. In this way, WOQ int8 would decrease about 700M VRAM, and WOQ int4 would decrease another 350M VRAM usage.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."