refit_cuda_engine method is too slow

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.64k stars 2.12k forks source link

refit_cuda_engine method is too slow #3332

Open davidli313 opened 1 year ago

davidli313 commented 1 year ago

Description

I used the python tensorrt refitter class to load the LoRA weights of stable diffusion unet, but the refitter.refit_cuda_engine method is so slow, usually taking 4~5 seconds. Is there any way to improve the performance of refit_cuda_engine?

Environment

TensorRT Version: 8.6.1 NVIDIA GPU: GeForce RTX 4090 NVIDIA Driver Version: 525.89.02 CUDA Version: 12.0 CUDNN Version: 8.9.2

Operating System: Ubuntu 20.04.1 Python Version (if applicable): 3.9.16 Tensorflow Version (if applicable):

PyTorch Version (if applicable): 1.12.1 Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

BowenFu commented 1 year ago

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

zhangvia commented 1 year ago

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

actually, the engine built with refit feature has a poor bad preformance in inference time consuming，especially with dynamic shape feature。and the inference time consuming is not stable. the refittable unet engine inference process costs 500ms to1000ms. that is slower than pytorch.

sunhs commented 1 year ago

@davidli313 @zhangvia Hi, can you point me to some samples of loading lora with refit ?

FuyuanChen commented 5 months ago

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

BowenFu commented 5 months ago

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

Please refer to "https://github.com/NVIDIA/TensorRT/blob/release/9.2/demo/Diffusion/utilities.py" and pass GPU weights to refitter instead of CPU weights to avoid internal memory allocation.