Open teith opened 7 months ago
+1
+1
+
Let me check with the author of the blog, come back later :-)
@teith Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro
@teith hello, may I ask if you have compared the accuracy of fp16 and in8? I run the diffusion demo with fp16 and int8, the images generated under the same seed quite different, not as good as described in the blog. The time performance is same as you in A100. here are fp16 and int8 image with same seed=111
@teith
What node are you using? Are you using AWS or GCP or your own machine? There are many other factors on this, for example memory sizes, concurrent workloads, temperature and so on.
If possible can you give me your onnx model? only the unet part should be good. @teith cc @TheBge12138
@TheBge12138 Thanks for the feedback, the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.
BTW, what quant config you used?
@TheBge12138 Thanks for the feedback, In the current code base came from 6 months ago, which is different from the Blog, the team is refreshing the code on this repo and will publish it very soon, and I will pin you again when the updating is finished.
BTW, what quant config you used?
@jingyu-ml I didn't change any code with the demo, so may be use the default? In addition to the time issue, I'm more care about the accuracy, I know that the iterative steps in unet will lose a lot of accuracy, so I'm very curious about how you solve it. I heard in another issue that you will update ammo and new calibration scripts in the future, look forward to your work. Thanks!
Hi, @azhurkevich!
Do you think we can ask you to post your TRT engine/send it to us for analysis? As well as env you are using for repro
Here is .plan models: https://mega.nz/folder/kCMVQDiR#DFofS7bZW1cBTRg0VJt6JA
And ENVS: https://yaso.su/A100ENVS
Hi, @jingyu-ml
What node are you using? Are you using AWS or GCP or your own machine?
I've used A100 40GB on Lambda Cloud
@teith is that possible for you to attach the int8 unet onnx file at somewhere?
Hi, @jingyu-ml !
is that possible for you to attach the int8 unet onnx file at somewhere?
@teith Apologies for the delayed response.
I ran your models on our A100-PCIE-40G GPU.
Here are the logs: fp16.log int8.log
FP16:
[05/07/2024-18:08:35] [I] Latency: min = 91.6599 ms, max = 96.4124 ms, mean = 93.0172 ms, median = 92.9111 ms, percentile(90%) = 93.825 ms, percentile(95%) = 94.1382 ms, percentile(99%) = 96.4124 ms
INT8:
[05/07/2024-17:50:55] [I] Latency: min = 75.9916 ms, max = 77.4071 ms, mean = 76.4652 ms, median = 76.4912 ms, percentile(90%) = 76.7236 ms, percentile(95%) = 76.9514 ms, percentile(99%) = 77.4071 ms
1.25x speedup over FP16 TRT, which is somewhat slower than our internal benchmarks but still faster than FP16. This discrepancy may be due to server instability; we plan to conduct further testing at your models. It's important to mention that the performance figures reported in our previous blog were based on the Ada 6000 GPU, not the A100. Performance can vary significantly across different GPUs.
Additionally, could you execute this command line on your server, ensuring that you have updated trtexec to version 9.3 or 10.0? 10.0 would be faster than 9.3.
# Downloading the TensorRT Tar File and unzip it
# cd into the folder
export LD_LIBRARY_PATH=$(pwd)/lib:$LD_LIBRARY_PATH
export PATH=$(pwd)/bin:$PATH
cd python
pip install tensorrt-<version>-cp<version>-cp<version>m-linux_x86_64.whl
cd ../onnx_graphsurgeon
pip install onnx_graphsurgeon-<version>-py2.py3-none-any.whl
# check the trtexec version
By this you can have the newest trtexec and the trt python version.
Then try this on your int8 onnx model:
trtexec --onnx=./unet.onnx --shapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=unetxl.int8.plan
And then remove the int8 flag, try again on your FP16 onnx model. If you are able to run these cmds, please also share the full log with me, thanks. trtexec will print the infer latency in the log. Then we can discuss the next step.
Hi @teith, @D1-3105,
Adding to @jingyu-ml's response above, you can refer to this latest benchmark to see expected speedup on other NVIDIA hardware. In general, we do observe a higher speedup on RTX 6000 Ada
.
Note that the quantization techniques used in TensorRT have now been moved into a new Nvidia product called TensorRT Model Optimizer. This does not change your workflows. We do encourage you to checkout related resources, and looking forward to your feedback:
@TheBge12138 Re image quality issue - the team has pointed out that there have been fixes in recent release. Could you try the latest TensorRT demoDiffusion example or Model Optimizer example, and let us know if it's still an issue? Note that these two examples have the same workflow, but Model Optimizer's repo has FP8 plugin and the latest on INT8, which are not in TensorRT repo yet at the moment.
Description
I recently attempted to utilize INT8 quantization with Stable Diffusion XL to enhance inference performance based on the claims made in a recent TensorRT blog post, which suggested that this approach could achieve a performance improvement of nearly 2x. However, my experiences do not align with these expectations. After implementing INT8 quantization, the performance improvement was notably less than advertised.
Environment
TensorRT Version: 10.0.0b6
NVIDIA GPU: A100
Operating System: Python Version: 3.10
Baremetal or Container (if so, version): Triton 24.03
Relevant Files
Logs: https://yaso.su/SDXLTestLogs
Steps To Reproduce
I closely followed the steps laid out in the README of the TensorRT repository for the Stable Diffusion XL demo, which you can find here: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion. Here's a brief rundown of what I did:
Expected Outcome: Based on NVIDIA's recommendations and claims, I was expecting that turning on INT8 quantization would almost double the performance compared to the standard run.
Actual Outcome: The performance boost from using INT8 quantization was much less than expected. To put it in numbers, without INT8 quantization, the inference took about 2779.89 ms (equating to 0.36 images per second), but with INT8 quantization, it improved slightly to about 2564.51 ms (or 0.39 images per second). This improvement is much smaller than the nearly 2x faster performance I was anticipating, which is a significant difference from what was claimed.
Commands or scripts: https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion
Have you tried the latest release?: Yes, the latest, 10.0.0b6 9.3.0 has the same result.