NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.82k stars 2.13k forks source link

How to refit engine with original weights after applying lora in engine #3628

Open newgrit1004 opened 9 months ago

newgrit1004 commented 9 months ago

Description

https://github.com/NVIDIA/TensorRT/tree/release/9.2/demo/Diffusion

I've been working on implementing the stable diffusion example in TensorRT 9.2, specifically focusing on refitting weights using LoRA. Firstly, I appreciate the ease of refitting weights using LoRA in the engine.

However, I'm encountering an issue when attempting to refit the engine with the original weights after applying LoRA. The problem arises from the sequence of different LoRA injections, resulting in unintended images that seem to be a combination of LoRA A and LoRA B.

The unintended images looks like the combination image of LoRA A and LoRA B.

To solve this problem, I reload the TensorRT engine then I can get the images I intended. However, this way has a bottleneck of reloading TensorRT engine for every requests when I run the server as operate.

LoRA A(image A)-> LoRA B(image B)-> LoRA A(looks like image A 0.3 image B 0.7): not properly working load engine -> LoRA A(image A) -> reload engine -> LoRA B(image LoRA B)-> reload engine -> LoRA A(image A) : working

The attached images when not properly working are below.

trt_debug_4_trt yaml image A, lora: sayakpaul/sd-model-finetuned-lora-t4

trt_debug_5_trt yaml image B, lora : WuLing/Genshin_Bennett_LoRA

trt_debug_4_trt yaml_2 looks like image A 0.3 image B 0.7

I tried load two engines, one engine is for refitting weights with lora and the another engine is for keeping original weights. The lines where I tried to modify is https://github.com/NVIDIA/TensorRT/blob/93b6044fc106b69bce6751f27aa9fc198b02bddc/demo/Diffusion/utilities.py#L207 Also, I try to refit twice. First time for original weights and the second time for unet weights with LoRA. [https://github.com/NVIDIA/TensorRT/blob/93b6044fc106b69bce6751f27aa9fc198b02bddc/demo/Diffusion/stable_diffusion_pipeline.py#L451]

However, it is hard to debug for me. Could you let me know the easy way to do this?? Also, I can print weight role from deserialize engine but it is hard to print weights from deserialized engine.

Environment

TensorRT Version: 9.2.0.post12.dev5

NVIDIA GPU: NVIDIA A30, GPU memory: 24.0 GB

NVIDIA Driver Version: 515.105.01

CUDA Version: V12.1.105

CUDNN Version: 8.9.3

Operating System: Linux, Distribution: Linux-5.4.0-149-generic-x86_64-with-glibc2.35

Python Version (if applicable): 3.10.6

Tensorflow Version (if applicable): X

PyTorch Version (if applicable): 2.1.0a0+b5021ba

Baremetal or Container (if so, version): Container

Relevant Files

Model link: runwayml/stable-diffusion-v1-5 https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main

Steps To Reproduce

Commands or scripts:

old LoRA format such as sayakpaul/sd-model-finetuned-lora-t4 is currently not working. https://github.com/NVIDIA/TensorRT/blob/93b6044fc106b69bce6751f27aa9fc198b02bddc/demo/Diffusion/models.py#L213

so you can modify the LoraLoader code to inject old format LoRA based on this PR. https://github.com/NVIDIA/TensorRT/pull/3595

Have you tried the latest release?: Yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

zerollzeng commented 9 months ago

Checking internally.

newgrit1004 commented 9 months ago

@zerollzeng, Thank you for your response.

I reproduce the error, so you can follow the process.

  1. Clone the repository.

    git clone https://github.com/newgrit1004/TensorRT.git -b refit_engine_debug --single-branch
    cd TensorRT
  2. Build the docker image for debugging refit engine.

    docker build -t lora_debug .
  3. Run the container using docker compose yaml

    
    docker compose up -d

4. Run the fastapi server
```bash
docker exec -it lora_test_container bash

# inside the container
cd $TRT_OSSPATH/demo/Diffusion
uvicorn main:app --reload --host=0.0.0.0

Wait for downoading pytorch models and building onnx model and TensorRT engine.

  1. After finishing building the engine, you can open the new terminal in the container.
    
    docker exec -it lora_test_container bash

inside the container

cd $TRT_OSSPATH/demo/Diffusion python client.py



Then check the three outputs in output folder.
These outputs are reproduced exactly what I wanted to show.

You should check the available gpu id inside the docker compose yaml and host ip inside demo/diffusion/client.py.

There are a few changes about refitting engine part inside demo/diffusion/stable_diffusion_pipeline.py

Please let me know if I am using TensorRT incorrectly.