NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.84k stars 2.14k forks source link

Unexpected acceleration of diffusion-demo-like pipeline when running nsys profile on GPU A10 #3698

Open zhexinli opened 8 months ago

zhexinli commented 8 months ago

Description

I follow the diffusion-demo to accelerate my only SD pipeline, building VAE encoder, VAE decoder, Unet into different engines, while other models are torch. When not using nsys profile, my pipeline takes 616ms to complete, but when nsys profile is enabled, It only takes 600ms to complete, which is unexpected since nsys always slows down inference in my past experience. But when I export CUDA_LAUNCH_BLOCKING=1 to force device syncronize, nsys profile (688ms) now takes a little longer than not profiling (665ms) as usual. So I think the GPU syncronization behavior is somehow different when nsys profile is enabled.

BTW, I sync before timing, so the timer should be right. I also use cudart.cudaEventCreate to measure time, the result is the same: The Unet and VAE decoder takes less time to complete when using nsys profile. image

I really want to now why, because finding the reason helps me to accelate the pipeline in futher degree. Does anyone encounter the same situtation?

Environment

TensorRT Version: tensorrt==9.2.0.post12.dev5

NVIDIA GPU: A10

NVIDIA Driver Version: 525.105.17

CUDA Version: V11.8

CUDNN Version:

Operating System:

Python Version (if applicable): 3.9.18

PyTorch Version (if applicable): 2.1.2

zerollzeng commented 8 months ago

i've seen user report this before, and profile will definitely introduce overhead. I guess it's caused by how you measure the time or there is not enough warm-up, or the gpu clock/loading is not stable etc.

zhexinli commented 8 months ago

OK, thanks for the reply! I'll just move on then.

ApolloRay commented 8 months ago

I have tried trt int8 quantization for SDXL, but the unet inference time augments. Do you have the same problem? python demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-turbo --onnx-dir ./dreamshaper_model/dreamshaper_onnx/ --engine-dir engine-sdxl-turbo --height 512 --width 512 --int8 unet inference 300ms, without int8 250ms.

ApolloRay commented 8 months ago

I have tried with the height 1024 width 1024, an A10 card can't hold it.

ApolloRay commented 8 months ago

大佬方便的话,可以加个微信吗。

zhexinli commented 8 months ago

大佬方便的话,可以加个微信吗。

可以啊,我的微信是 FusRoDah15