NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.87k stars 2.14k forks source link

Lower-than-Expected Performance Improvement with INT8 Quantization in TensorRT 10.0 on A100 GPU #3776

Open teith opened 7 months ago

teith commented 7 months ago

Description

I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent TensorRT blog post, which suggested that this approach could achieve a performance improvement of nearly 2x. However, my experiences do not align with these expectations. After implementing INT8 quantization, the performance improvement was notably less than advertised.

Environment

TensorRT Version: 10.0.0b6

NVIDIA GPU: A100

Operating System: Python Version: 3.10

Baremetal or Container (if so, version): Triton 24.03

Relevant Files

Logs: https://yaso.su/SDXLTestLogs

Steps To Reproduce

I closely followed the steps laid out in the README of the TensorRT repository for the Stable Diffusion XL demo, which you can find here: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion. Here's a brief rundown of what I did:

  1. I cloned the TensorRT repository and navigated to the section for the Stable Diffusion XL demo, as instructed in the README.
  2. I followed all the setup and installation instructions in the README to properly prepare my environment and the models for testing. This included setups for both the standard and INT8 quantized inferences.
  3. I first ran the model with the standard setup to get a baseline of how fast it performed.
  4. Then, I ran the model with INT8 quantization enabled to see how much the performance would improve.

Expected Outcome: Based on NVIDIA's recommendations and claims, I was expecting that turning on INT8 quantization would almost double the performance compared to the standard run.

Actual Outcome: The performance boost from using INT8 quantization was much less than expected. To put it in numbers, without INT8 quantization, the inference took about 2779.89 ms (equating to 0.36 images per second), but with INT8 quantization, it improved slightly to about 2564.51 ms (or 0.39 images per second). This improvement is much smaller than the nearly 2x faster performance I was anticipating, which is a significant difference from what was claimed.

Commands or scripts: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion

python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3

Have you tried the latest release?: Yes, the latest, 10.0.0b6 9.3.0 has the same result.

D1-3105 commented 7 months ago

+1

yaroslavMain commented 7 months ago

+1

SkobelkinYaroslav commented 7 months ago

+

zerollzeng commented 7 months ago

Let me check with the author of the blog, come back later :-)

azhurkevich commented 7 months ago

@teith Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro

TheBge12138 commented 7 months ago

@teith hello, may I ask if you have compared the accuracy of fp16 and in8? I run the diffusion demo with fp16 and int8, the images generated under the same seed quite different, not as good as described in the blog. The time performance is same as you in A100. fp16 xl_base-a_photo_of-111-1-5594 here are fp16 and int8 image with same seed=111

jingyu-ml commented 7 months ago

@teith

What node are you using? Are you using AWS or GCP or your own machine? There are many other factors on this, for example memory sizes, concurrent workloads, temperature and so on.

If possible can you give me your onnx model? only the unet part should be good. @teith cc @TheBge12138

jingyu-ml commented 7 months ago

@TheBge12138 Thanks for the feedback, the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.

BTW, what quant config you used?

TheBge12138 commented 7 months ago

@TheBge12138 Thanks for the feedback, In the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.

BTW, what quant config you used?

@jingyu-ml I didn't change any code with the demo, so may be use the default? In addition to the time issue, I'm more care about the accuracy, I know that the iterative steps in unet will lose a lot of accuracy, so I'm very curious about how you solve it. I heard in another issue that you will update ammo and new calibration scripts in the future, look forward to your work. Thanks!

teith commented 7 months ago

Hi, @azhurkevich!

Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro

Here is .plan models: https://mega.nz/folder/kCMVQDiR#DFofS7bZW1cBTRg0VJt6JA

And ENVS: https://yaso.su/A100ENVS

teith commented 7 months ago

Hi, @jingyu-ml

What node are you using? Are you using AWS or GCP or your own machine?

I've used A100 40GB on Lambda Cloud

jingyu-ml commented 7 months ago

@teith is that possible for you to attach the int8 unet onnx file at somewhere?

teith commented 7 months ago

Hi, @jingyu-ml !

is that possible for you to attach the int8 unet onnx file at somewhere?

model.onnx 76d7dfc4-f8d8-11ee-a05c-0242ac120002

jingyu-ml commented 6 months ago

@teith Apologies for the delayed response.

I ran your models on our A100-PCIE-40G GPU.

Here are the logs: fp16.log int8.log

FP16:

[05/07/2024-18:08:35] [I] Latency: min = 91.6599 ms, max = 96.4124 ms, mean = 93.0172 ms, median = 92.9111 ms, percentile(90%) = 93.825 ms, percentile(95%) = 94.1382 ms, percentile(99%) = 96.4124 ms

INT8:

[05/07/2024-17:50:55] [I] Latency: min = 75.9916 ms, max = 77.4071 ms, mean = 76.4652 ms, median = 76.4912 ms, percentile(90%) = 76.7236 ms, percentile(95%) = 76.9514 ms, percentile(99%) = 77.4071 ms

1.25x speedup over FP16 TRT, which is somewhat slower than our internal benchmarks but still faster than FP16. This discrepancy may be due to server instability; we plan to conduct further testing at your models. It's important to mention that the performance figures reported in our previous blog were based on the Ada 6000 GPU, not the A100. Performance can vary significantly across different GPUs.

Additionally, could you execute this command line on your server, ensuring that you have updated trtexec to version 9.3 or 10.0? 10.0 would be faster than 9.3.

# Downloading the TensorRT Tar File and unzip it
# cd into the folder
export LD_LIBRARY_PATH=$(pwd)/lib:$LD_LIBRARY_PATH
export PATH=$(pwd)/bin:$PATH

cd python
pip install tensorrt-<version>-cp<version>-cp<version>m-linux_x86_64.whl

cd ../onnx_graphsurgeon
pip install onnx_graphsurgeon-<version>-py2.py3-none-any.whl

# check the trtexec version

By this you can have the newest trtexec and the trt python version.

Then try this on your int8 onnx model:

trtexec --onnx=./unet.onnx --shapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=unetxl.int8.plan

And then remove the int8 flag, try again on your FP16 onnx model. If you are able to run these cmds, please also share the full log with me, thanks. trtexec will print the infer latency in the log. Then we can discuss the next step.

hchings commented 6 months ago

Hi @teith, @D1-3105,

Adding to @jingyu-ml's response above, you can refer to this latest benchmark to see expected speedup on other NVIDIA hardware. In general, we do observe a higher speedup on RTX 6000 Ada.

Note that the quantization techniques used in TensorRT have now been moved into a new Nvidia product called TensorRT Model Optimizer. This does not change your workflows. We do encourage you to checkout related resources, and looking forward to your feedback:

@TheBge12138 Re image quality issue - the team has pointed out that there have been fixes in recent release. Could you try the latest TensorRT demoDiffusion example or Model Optimizer example, and let us know if it's still an issue? Note that these two examples have the same workflow, but Model Optimizer's repo has FP8 plugin and the latest on INT8, which are not in TensorRT repo yet at the moment.