Open zhexinli opened 8 months ago
i've seen user report this before, and profile will definitely introduce overhead. I guess it's caused by how you measure the time or there is not enough warm-up, or the gpu clock/loading is not stable etc.
OK, thanks for the reply! I'll just move on then.
I have tried trt int8 quantization for SDXL, but the unet inference time augments. Do you have the same problem?
python demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-turbo --onnx-dir ./dreamshaper_model/dreamshaper_onnx/ --engine-dir engine-sdxl-turbo --height 512 --width 512 --int8
unet inference 300ms, without int8 250ms.
I have tried with the height 1024 width 1024, an A10 card can't hold it.
大佬方便的话,可以加个微信吗。
大佬方便的话,可以加个微信吗。
可以啊,我的微信是 FusRoDah15
Description
I follow the diffusion-demo to accelerate my only SD pipeline, building VAE encoder, VAE decoder, Unet into different engines, while other models are torch. When not using nsys profile, my pipeline takes 616ms to complete, but when nsys profile is enabled, It only takes 600ms to complete, which is unexpected since nsys always slows down inference in my past experience. But when I export CUDA_LAUNCH_BLOCKING=1 to force device syncronize, nsys profile (688ms) now takes a little longer than not profiling (665ms) as usual. So I think the GPU syncronization behavior is somehow different when nsys profile is enabled.
BTW, I sync before timing, so the timer should be right. I also use cudart.cudaEventCreate to measure time, the result is the same: The Unet and VAE decoder takes less time to complete when using nsys profile.
I really want to now why, because finding the reason helps me to accelate the pipeline in futher degree. Does anyone encounter the same situtation?
Environment
TensorRT Version: tensorrt==9.2.0.post12.dev5
NVIDIA GPU: A10
NVIDIA Driver Version: 525.105.17
CUDA Version: V11.8
CUDNN Version:
Operating System:
Python Version (if applicable): 3.9.18
PyTorch Version (if applicable): 2.1.2