NVIDIA / Stable-Diffusion-WebUI-TensorRT

TensorRT Extension for Stable Diffusion Web UI
MIT License
1.91k stars 145 forks source link

Very limited LoRA Functionality #116

Open ThisIsNetsu opened 12 months ago

ThisIsNetsu commented 12 months ago

So I noticed that we can add a single LoRA to the TRT model. Problems here are:

  1. It is only ever a single LoRA at any given time, and you have to build a new TRT model for each.
  2. The implemented LoRA always seems to be at full (1.0) strength.

This seems to be a major downside currently as this heavily limits what I can do with any given model. As long as I am just using baseline SD checkpoints this is no problem, but as soon as I introduce a LoRA to the generation process, this becomes messy, and depending on the LoRa, unusable. As soon as I want to have multiple LoRAs, it basically is impossible.

contentis commented 12 months ago

I'm working on improving this. Currently, we require an ONNX model containing the applied weights to refit the engine just in time. And in the short term, I don't see a workaround for this.....

For now, there are two options:

devjo commented 12 months ago

I understood the second option, but not the first. What do you mean "default LoRA embeddings"? By default a checkpoint has no embedding, so what specifically did you mean by default, and how does that relate to variable LoRA strength configuration when fusing the checkpoint and LoRA weights?

As a side-note, fantastic work on this extension Luca, much appreciated! Your side-project has led to a sale of 2 x 4090 cards and a 3080 already, so I hope your employer allows you some paid time to work on this project :)

Update I'm using the following workaround until configurable lora weights are added to the extension:

  1. Manually fuse the LoRAs you want into whatever checkpoint you want to use for the TensorRT variant.
    I'm using kohya_ss which happens to be the same tool I use for also creating my LoRAs in the first place. The weight of 0.6 in the screenshot below is equivalent to specifying <lora: .... :0.6> in the SD-webui.

    fuse_lora_with_kohya

  2. Use the fused checkpoint as the sole input to the TensorRT extension when doing the conversion.

I.e. don't specify the LoRA in the extension, since you've already merged the checkpoint and LoRAs you want to use in step 1. By splitting these tasks, it allows you to quickly verify that the fused checkpoint has the weights and visual effects you expect, before spending the time converting the checkpoint to a TensorRT optimized one.

contentis commented 12 months ago

What do you mean "default LoRA embeddings"?

I am not sure what the correct terminology would be, but I intended the <loraName:Scale> syntax in the prompt.

After some more thought, I think this is my preferred way of doing things:

I appreciate the positive feedback!

devjo commented 12 months ago

When does the TensorRT compilation happen in the above two-step procedure?

contentis commented 12 months ago

The compilation happens during the export of the base model. I'm working on a POC right now to see how feasible this is. If everything goes smoothly (which it never does) I could have something in a few hours hacked together.

devjo commented 12 months ago

LoRA checkpoints still need to be exported through the TensorRT tab. Export in this case means layout and weight transformation, so no TensorRT compile for each LoRA checkpoint.

and

The compilation happens during the export of the base model.

Does this mean that only the base model (checkpoint) needs to be TRT compiled, and that you can use the lora ranks and weights as-is without creating TRT variants of them?
The statement "no TensorRT compile for each LoRA checkpoint" led me to believe the latter.

Or I may have misunderstood, and you meant there could be two separate compiles, one for the base checkpoint and another for the LoRA ranks and weights separately.

Either way, you seem to indicate there may be a way around the L x W x C (#loras x #weights-for-lora-fusing x #base-checkpoints) combinatorial explosion of number of exported onnx and trt models the current scheme requires, so am I interpreting your answer to mean there may be a way to fuse the lora weights (be they compiled or not) with the compiled base checkpoint "engine" dynamically (at inference time)?

contentis commented 12 months ago

Does this mean that only the base model (checkpoint) needs to be TRT compiled, and that you can use the lora ranks and weights as-is without creating TRT variants of them?

Yesn't - Long explanation: The engine export consists of two steps:

  1. ONNX export
  2. TensorRT engine compilation

TensorRT requires ONNX as an intermediate representation to lower the graphs IR. Therfore, we cannot leverage torch checkpoints directly, but need to lower them through ONNX. This also applies to LoRA checkpoints.

Therefore LoRA checkpoints still need to be exported. But this only requires step 1 (ONNX export) rather than the actual compilation. We'll leverage the compiled base models and apply LoRA then.


Here is an end-to-end example assuming you start from scratch:

I have SD 1.5 installed and two random LoRAs (BarbieCore, pk_trainer768_V1)

  1. I need to export the base model (SD1.5) to TensorRT (ONNX Export + Compile)
  2. I need to export my LoRA checkpoints (ONNX Export only)

From here LoRA should work as (native) in the UI using prompts like: <lora:BarbieCore:0.7> <lora:pk_trainer768_V1:1> A pixel image of a man with a sword Barbiecore

I hope this clarifies things, and isn't more confusing than before :D

contentis commented 12 months ago

Here is a screenshot of how it looks like at inference. This also allows the SD Unet dropdown to be set to automatic.

Screenshot 2023-10-25 at 12 46 31

Limitations

contentis commented 12 months ago

I pushed my PoC to lora_v2 in case you dare to test it.

devjo commented 12 months ago

I pushed my PoC to lora_v2 in case you dare to test it.

I did, and it worked flawlessly. My hat's off to you. Awesome job!

Applying the LoRA is pretty slow at the moment (~10s). But when using the same scales and loras this is being cached.

Yes, but unless someone rotates a bunch of different LoRAs in and out constantly in the XYZ plot, it shouldn't be a problem. The initial load of a new lora took about what you describe, but once loaded, I could tweak the weights without that initial pause. Started rendering instantly. The main use case for varying weights a lot is likely the XYZ plot, and I'm happy to report that it iterates over different weights in the same lora(s) at full throttle.

I'm still a bit perplexed that there was no discernable performance impact (regression), since only the base checkpoint is TRT compiled, and the LoRA ranks and weights seem not to (unless you do on-the-fly compilation during loading at inference time). Haven't had time to look at the code change yet.

For anyone else wanting to use this:

  1. Go into the extension directory and check out the lora_v2 branch.
  2. Delete any fused checkpoint + lora tensorrt and ONNX files that you may have created earlier for the combinations. You can keep the ONNX and TRT models for the base checkpoint itself, since the code change doesn't seem to affect those.
  3. edit the model.json file in the Unet-trt directory by removing the stanzas (blocks) related to the files deleted in step 2.
  4. Apply the LoRAs as usual with SD-webUI, and enjoy the 2x rendering speedup.
contentis commented 12 months ago

I am happy to hear it works on a machine other than mine :D I also tested LyCORIS, and for me, it seemed to have worked fine.

One general disclaimer: This is a work in progress, and there might be breaking changes before it finds its way into the main branch.

Sniper199999 commented 12 months ago

On the fly Lora... IT WORKS!!!

With my limited testing on 3 different loras, here is my feedback: The loras have small amount of effect on the generated images. That is, the loras seems to work, but they are not as effective as when i use them without TensorRT. Here is the error I get:

Apllying LoRAs: ['lora:TheRockV3:1']███████████████████████████████████████████████████| 30/30 [00:04<00:00,  7.89it/s]
*** Error running process: C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py
    Traceback (most recent call last):
      File "C:\stable-diffusion-webui\modules\scripts.py", line 623, in process
        script.process(p, *script_args)
      File "C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 191, in process
        self.get_loras(p)
      File "C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 229, in get_loras
        modelmanager.available_models()[lora_name][0]["filepath"],
    KeyError: 'TheRockV3'

Everything works as expected. I got this error due to my misunderstanding. You need to export lora models to TRT format. You need to keep Base model and the lora models in the Unet-trt Folder. You can delete the models inside onnx folders. The exported TRT Lora models seems to be around 500MB, can they be compressed further?

szokolai-mate commented 11 months ago

+1, I've also managed to get it working. Amazing stuff, now I need only ControlNet to completely move over to TensorRT. I started with a fresh installation with only this extension and besides my incompetence, there were no issues. The resulting image is the exact same with and without TensorRt unet enabled.

Do you think there would be major work needed before this can be merged?

ThisIsNetsu commented 11 months ago

I am getting an error when trying to convert a lora to trt:

[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [E] ONNX-Runtime is not installed, so constant folding may be suboptimal or not work at all. Consider installing ONNX-Runtime: I:\sd.webui\system\python\python.exe -m pip install onnxruntime [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. Exported to ONNX. Traceback (most recent call last): File "I:\sd.webui\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict output = await app.get_blocks().process_api( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api result = await self.call_function( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function prediction = await anyio.to_thread.run_sync( File "I:\sd.webui\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "I:\sd.webui\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper response = f(args, **kwargs) File "I:\sd.webui\webui\extensions\Stable-Diffusion-WebUI-TensorRT\ui_trt.py", line 247, in export_lora_to_trt if len(available_trt_unet[base_name]) == 0: KeyError: 'revAnimated_v122'

mort666 commented 11 months ago

Hi, So I've been testing out your Lora_v2 branch and while inpainting I've been getting an error about a missing attribute as highlighted below.

*** Error running before_process: /content/stable-diffusion-webui/extensions/Stable-Diffusion-WebUI-TensorR/scripts/trt.py
    Traceback (most recent call last):
      File "/content/stable-diffusion-webui/modules/scripts.py", line 615, in before_process
        script.before_process(p, *script_args)
      File "/content/stable-diffusion-webui/extensions/Stable-Diffusion-WebUI-TensorR/scripts/trt.py", line 129, in before_process
        if p.enable_hr:
    AttributeError: 'StableDiffusionProcessingImg2Img' object has no attribute 'enable_hr'

I have seen this play out with a couple other extensions previously you may want to consider using something like getattr() to see if the attribute has a value and if it is not set return a default value. You could do somethinng like the following where you previously have accessed it:

    def before_process(self, p, *args):  # 1
        # Check divisibilty
        if p.width % 64 or p.height % 64:
            gr.Error("Target resolution must be divisible by 64 in both dimensions.")

        enable_hr = getattr(p, 'enable_hr', False)
        if enable_hr:
            hr_w = int(p.width * p.hr_scale)
            hr_h = int(p.height * p.hr_scale)
            if hr_w % 64 or hr_h % 64:
                gr.Error(
                    "HIRES Fix resolution must be divisible by 64 in both dimensions. Please change the upscale factor or disable HIRES Fix."
                )

That should achieve the same results as you were intending.

DuckersMcQuack commented 11 months ago

I pushed my PoC to lora_v2 in case you dare to test it.

So to make sure, is lora_v2 basically dev branch?

And do you have any tips to not need 30 2GB for every resolution/aspect ratio variable known to man? xD As currently i got 30 different ones as from the gist i got, a more dynamic res one will allow more different resolution, but not as fast.

worksforme commented 11 months ago

So to make sure, is lora_v2 basically dev branch?

lora_v2 has commits on top of the dev branch which allows loading multiple loras and changing lora weights. You have to convert the lora to TensorRT but do not need to specify a resolution.

And do you have any tips to not need 30 2GB for every resolution/aspect ratio variable known to man? xD As currently i got 30 different ones as from the gist i got, a more dynamic res one will allow more different resolution, but not as fast.

I would create one dynamic if you use multiple resolutions and aspect ratios. Here's an example for SD 1.5 for generations from 512-768 (any aspect ratio) with hires fix up to 2x.

image

I do not know what the optimal field does so I set it to match the max (if anyone has an explanation, please share). It still has a big speed improvement over not using TensorRT.

intoempty commented 11 months ago

How do I indicate on the TensorRT tab that I want an "ONNX Export only" for a given Lora?

Moreover, what's the obvious thing to check if I've:

and the Loras are not being applied at inference?

Is it that I need to delete the onnx loras?

FerLuisxd commented 10 months ago

I am getting an error when trying to convert a lora to trt:

[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [E] ONNX-Runtime is not installed, so constant folding may be suboptimal or not work at all. Consider installing ONNX-Runtime: I:\sd.webui\system\python\python.exe -m pip install onnxruntime [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. Exported to ONNX. Traceback (most recent call last): File "I:\sd.webui\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict output = await app.get_blocks().process_api( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api result = await self.call_function( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function prediction = await anyio.to_thread.run_sync( File "I:\sd.webui\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "I:\sd.webui\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper response = f(args, **kwargs) File "I:\sd.webui\webui\extensions\Stable-Diffusion-WebUI-TensorRT\ui_trt.py", line 247, in export_lora_to_trt if len(available_trt_unet[base_name]) == 0: KeyError: 'revAnimated_v122'

Just a quick question about this log, is installing ONNX-Runtime necessary, a good thing? to omit that error or have faster compile times?

qybing commented 10 months ago

When I switched to the Lora_v2 branch and converted it to TensorRT, I encountered the following error while using it.

Error running process: J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py
    Traceback (most recent call last):
      File "J:\x\stable-diffusion-webui\modules\scripts.py", line 710, in process
        script.process(p, *script_args)
      File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 191, in process
        self.get_loras(p)
      File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 238, in get_loras
        refit_dict = apply_loras(base_path, lora_pathes, lora_scales)
      File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\lora.py", line 32, in apply_loras
        add_to_map(refit_dict, name, n.outputs[0].values)
    AttributeError: 'Variable' object has no attribute 'values'
CoolCuda commented 10 months ago

Hello everybody

Do you know if it's normal that it's create 1 single TRT file per lora and per checkpoint ? The size of each trt file generated is 1,6 Go Inside the folder models\Unet-trt Do you know if I can do an optimization ?

Because if I have 15 loras and 3 checkpoints. No problem to take time to generate it. But it will take too much space on the hard drive...

Thank you

bigmover commented 4 months ago

I'm working on improving this. Currently, we require an ONNX model containing the applied weights to refit the engine just in time. And in the short term, I don't see a workaround for this.....

For now, there are two options:

  • Use the default LoRA embeddings and export the ONNX model JIT. This takes approx. 40s for SD1.5 model.
  • Extend the current LoRA exporter to support multiple LoRA and strength.

I found an issue for the lora on switching engine which one lora based on A model onnx can't fit in B model because of mismatching shape channel. Any solution or advice about it?

bigmover commented 4 months ago

Hello everybody

Do you know if it's normal that it's create 1 single TRT file per lora and per checkpoint ? The size of each trt file generated is 1,6 Go Inside the folder models\Unet-trt Do you know if I can do an optimization ?

Because if I have 15 loras and 3 checkpoints. No problem to take time to generate it. But it will take too much space on the hard drive...

Thank you

perhaps it's normal. If you don't care about the time of inference, perhaps you can cal lora at runtime.

bigmover commented 4 months ago

tweak

Hi guys! Would you mind to share the method to use one lora on different base model as pytorch?