Open ThisIsNetsu opened 12 months ago
I'm working on improving this. Currently, we require an ONNX model containing the applied weights to refit the engine just in time. And in the short term, I don't see a workaround for this.....
For now, there are two options:
I understood the second option, but not the first. What do you mean "default LoRA embeddings"? By default a checkpoint has no embedding, so what specifically did you mean by default, and how does that relate to variable LoRA strength configuration when fusing the checkpoint and LoRA weights?
As a side-note, fantastic work on this extension Luca, much appreciated! Your side-project has led to a sale of 2 x 4090 cards and a 3080 already, so I hope your employer allows you some paid time to work on this project :)
Update I'm using the following workaround until configurable lora weights are added to the extension:
Manually fuse the LoRAs you want into whatever checkpoint you want to use for the TensorRT variant.
I'm using kohya_ss which happens to be the same tool I use for also creating my LoRAs in the first place. The weight of 0.6
in the screenshot below is equivalent to specifying <lora: .... :0.6>
in the SD-webui.
Use the fused checkpoint as the sole input to the TensorRT extension when doing the conversion.
I.e. don't specify the LoRA in the extension, since you've already merged the checkpoint and LoRAs you want to use in step 1. By splitting these tasks, it allows you to quickly verify that the fused checkpoint has the weights and visual effects you expect, before spending the time converting the checkpoint to a TensorRT optimized one.
What do you mean "default LoRA embeddings"?
I am not sure what the correct terminology would be, but I intended the <loraName:Scale>
syntax in the prompt.
After some more thought, I think this is my preferred way of doing things:
<loraName:Scale>
syntax can be used (given the LoRAs you want to use have been exported)I appreciate the positive feedback!
When does the TensorRT compilation happen in the above two-step procedure?
The compilation happens during the export of the base model. I'm working on a POC right now to see how feasible this is. If everything goes smoothly (which it never does) I could have something in a few hours hacked together.
LoRA checkpoints still need to be exported through the TensorRT tab. Export in this case means layout and weight transformation, so no TensorRT compile for each LoRA checkpoint.
and
The compilation happens during the export of the base model.
Does this mean that only the base model (checkpoint) needs to be TRT compiled, and that you can use the lora ranks and weights as-is without creating TRT variants of them?
The statement "no TensorRT compile for each LoRA checkpoint" led me to believe the latter.
Or I may have misunderstood, and you meant there could be two separate compiles, one for the base checkpoint and another for the LoRA ranks and weights separately.
Either way, you seem to indicate there may be a way around the L x W x C (#loras x #weights-for-lora-fusing x #base-checkpoints) combinatorial explosion of number of exported onnx and trt models the current scheme requires, so am I interpreting your answer to mean there may be a way to fuse the lora weights (be they compiled or not) with the compiled base checkpoint "engine" dynamically (at inference time)?
Does this mean that only the base model (checkpoint) needs to be TRT compiled, and that you can use the lora ranks and weights as-is without creating TRT variants of them?
Yesn't - Long explanation: The engine export consists of two steps:
TensorRT requires ONNX as an intermediate representation to lower the graphs IR. Therfore, we cannot leverage torch checkpoints directly, but need to lower them through ONNX. This also applies to LoRA checkpoints.
Therefore LoRA checkpoints still need to be exported. But this only requires step 1 (ONNX export) rather than the actual compilation. We'll leverage the compiled base models and apply LoRA then.
Here is an end-to-end example assuming you start from scratch:
I have SD 1.5 installed and two random LoRAs (BarbieCore, pk_trainer768_V1)
From here LoRA should work as (native) in the UI using prompts like: <lora:BarbieCore:0.7> <lora:pk_trainer768_V1:1> A pixel image of a man with a sword Barbiecore
I hope this clarifies things, and isn't more confusing than before :D
Here is a screenshot of how it looks like at inference. This also allows the SD Unet
dropdown to be set to automatic.
Limitations
I pushed my PoC to lora_v2 in case you dare to test it.
I pushed my PoC to lora_v2 in case you dare to test it.
I did, and it worked flawlessly. My hat's off to you. Awesome job!
Applying the LoRA is pretty slow at the moment (~10s). But when using the same scales and loras this is being cached.
Yes, but unless someone rotates a bunch of different LoRAs in and out constantly in the XYZ plot, it shouldn't be a problem. The initial load of a new lora took about what you describe, but once loaded, I could tweak the weights without that initial pause. Started rendering instantly. The main use case for varying weights a lot is likely the XYZ plot, and I'm happy to report that it iterates over different weights in the same lora(s) at full throttle.
I'm still a bit perplexed that there was no discernable performance impact (regression), since only the base checkpoint is TRT compiled, and the LoRA ranks and weights seem not to (unless you do on-the-fly compilation during loading at inference time). Haven't had time to look at the code change yet.
For anyone else wanting to use this:
lora_v2
branch.model.json
file in the Unet-trt
directory by removing the stanzas (blocks) related to the files deleted in step 2.I am happy to hear it works on a machine other than mine :D I also tested LyCORIS, and for me, it seemed to have worked fine.
One general disclaimer: This is a work in progress, and there might be breaking changes before it finds its way into the main branch.
On the fly Lora... IT WORKS!!!
With my limited testing on 3 different loras, here is my feedback:
The loras have small amount of effect on the generated images. That is, the loras seems to work, but they are not as effective as when i use them without TensorRT.
Here is the error I get:
Apllying LoRAs: ['lora:TheRockV3:1']███████████████████████████████████████████████████| 30/30 [00:04<00:00, 7.89it/s]
*** Error running process: C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py
Traceback (most recent call last):
File "C:\stable-diffusion-webui\modules\scripts.py", line 623, in process
script.process(p, *script_args)
File "C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 191, in process
self.get_loras(p)
File "C:\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 229, in get_loras
modelmanager.available_models()[lora_name][0]["filepath"],
KeyError: 'TheRockV3'
Everything works as expected. I got this error due to my misunderstanding. You need to export lora models to TRT format. You need to keep Base model and the lora models in the Unet-trt Folder. You can delete the models inside onnx folders. The exported TRT Lora models seems to be around 500MB, can they be compressed further?
+1, I've also managed to get it working. Amazing stuff, now I need only ControlNet to completely move over to TensorRT. I started with a fresh installation with only this extension and besides my incompetence, there were no issues. The resulting image is the exact same with and without TensorRt unet enabled.
Do you think there would be major work needed before this can be merged?
I am getting an error when trying to convert a lora to trt:
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [E] ONNX-Runtime is not installed, so constant folding may be suboptimal or not work at all. Consider installing ONNX-Runtime: I:\sd.webui\system\python\python.exe -m pip install onnxruntime [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. Exported to ONNX. Traceback (most recent call last): File "I:\sd.webui\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict output = await app.get_blocks().process_api( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api result = await self.call_function( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function prediction = await anyio.to_thread.run_sync( File "I:\sd.webui\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "I:\sd.webui\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper response = f(args, **kwargs) File "I:\sd.webui\webui\extensions\Stable-Diffusion-WebUI-TensorRT\ui_trt.py", line 247, in export_lora_to_trt if len(available_trt_unet[base_name]) == 0: KeyError: 'revAnimated_v122'
Hi, So I've been testing out your Lora_v2 branch and while inpainting I've been getting an error about a missing attribute as highlighted below.
*** Error running before_process: /content/stable-diffusion-webui/extensions/Stable-Diffusion-WebUI-TensorR/scripts/trt.py
Traceback (most recent call last):
File "/content/stable-diffusion-webui/modules/scripts.py", line 615, in before_process
script.before_process(p, *script_args)
File "/content/stable-diffusion-webui/extensions/Stable-Diffusion-WebUI-TensorR/scripts/trt.py", line 129, in before_process
if p.enable_hr:
AttributeError: 'StableDiffusionProcessingImg2Img' object has no attribute 'enable_hr'
I have seen this play out with a couple other extensions previously you may want to consider using something like getattr()
to see if the attribute has a value and if it is not set return a default value. You could do somethinng like the following where you previously have accessed it:
def before_process(self, p, *args): # 1
# Check divisibilty
if p.width % 64 or p.height % 64:
gr.Error("Target resolution must be divisible by 64 in both dimensions.")
enable_hr = getattr(p, 'enable_hr', False)
if enable_hr:
hr_w = int(p.width * p.hr_scale)
hr_h = int(p.height * p.hr_scale)
if hr_w % 64 or hr_h % 64:
gr.Error(
"HIRES Fix resolution must be divisible by 64 in both dimensions. Please change the upscale factor or disable HIRES Fix."
)
That should achieve the same results as you were intending.
I pushed my PoC to lora_v2 in case you dare to test it.
So to make sure, is lora_v2 basically dev branch?
And do you have any tips to not need 30 2GB for every resolution/aspect ratio variable known to man? xD As currently i got 30 different ones as from the gist i got, a more dynamic res one will allow more different resolution, but not as fast.
So to make sure, is lora_v2 basically dev branch?
lora_v2
has commits on top of the dev
branch which allows loading multiple loras and changing lora weights. You have to convert the lora to TensorRT but do not need to specify a resolution.
And do you have any tips to not need 30 2GB for every resolution/aspect ratio variable known to man? xD As currently i got 30 different ones as from the gist i got, a more dynamic res one will allow more different resolution, but not as fast.
I would create one dynamic if you use multiple resolutions and aspect ratios. Here's an example for SD 1.5 for generations from 512-768 (any aspect ratio) with hires fix up to 2x.
I do not know what the optimal field does so I set it to match the max (if anyone has an explanation, please share). It still has a big speed improvement over not using TensorRT.
How do I indicate on the TensorRT tab that I want an "ONNX Export only" for a given Lora?
Moreover, what's the obvious thing to check if I've:
lora_v2
Automatic
in the SD_unet selector in the UI<lora:X>
to the promptand the Loras are not being applied at inference?
Is it that I need to delete the onnx loras?
I am getting an error when trying to convert a lora to trt:
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored [E] ONNX-Runtime is not installed, so constant folding may be suboptimal or not work at all. Consider installing ONNX-Runtime: I:\sd.webui\system\python\python.exe -m pip install onnxruntime [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. [W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored [W] Inference failed. You may want to try enabling partitioning to see better results. Note: Error was: No module named 'onnxruntime' [!] Module: 'onnxruntime.tools.symbolic_shape_infer' is required but could not be imported. Note: Error was: No module named 'onnxruntime' You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules. Exported to ONNX. Traceback (most recent call last): File "I:\sd.webui\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict output = await app.get_blocks().process_api( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api result = await self.call_function( File "I:\sd.webui\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function prediction = await anyio.to_thread.run_sync( File "I:\sd.webui\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "I:\sd.webui\system\python\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "I:\sd.webui\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper response = f(args, **kwargs) File "I:\sd.webui\webui\extensions\Stable-Diffusion-WebUI-TensorRT\ui_trt.py", line 247, in export_lora_to_trt if len(available_trt_unet[base_name]) == 0: KeyError: 'revAnimated_v122'
Just a quick question about this log, is installing ONNX-Runtime necessary, a good thing? to omit that error or have faster compile times?
When I switched to the Lora_v2 branch and converted it to TensorRT, I encountered the following error while using it.
Error running process: J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py
Traceback (most recent call last):
File "J:\x\stable-diffusion-webui\modules\scripts.py", line 710, in process
script.process(p, *script_args)
File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 191, in process
self.get_loras(p)
File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\trt.py", line 238, in get_loras
refit_dict = apply_loras(base_path, lora_pathes, lora_scales)
File "J:\x\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\scripts\lora.py", line 32, in apply_loras
add_to_map(refit_dict, name, n.outputs[0].values)
AttributeError: 'Variable' object has no attribute 'values'
Hello everybody
Do you know if it's normal that it's create 1 single TRT file per lora and per checkpoint ? The size of each trt file generated is 1,6 Go Inside the folder models\Unet-trt Do you know if I can do an optimization ?
Because if I have 15 loras and 3 checkpoints. No problem to take time to generate it. But it will take too much space on the hard drive...
Thank you
I'm working on improving this. Currently, we require an ONNX model containing the applied weights to refit the engine just in time. And in the short term, I don't see a workaround for this.....
For now, there are two options:
- Use the default LoRA embeddings and export the ONNX model JIT. This takes approx. 40s for SD1.5 model.
- Extend the current LoRA exporter to support multiple LoRA and strength.
I found an issue for the lora on switching engine which one lora based on A model onnx can't fit in B model because of mismatching shape channel. Any solution or advice about it?
Hello everybody
Do you know if it's normal that it's create 1 single TRT file per lora and per checkpoint ? The size of each trt file generated is 1,6 Go Inside the folder models\Unet-trt Do you know if I can do an optimization ?
Because if I have 15 loras and 3 checkpoints. No problem to take time to generate it. But it will take too much space on the hard drive...
Thank you
perhaps it's normal. If you don't care about the time of inference, perhaps you can cal lora at runtime.
tweak
Hi guys! Would you mind to share the method to use one lora on different base model as pytorch?
So I noticed that we can add a single LoRA to the TRT model. Problems here are:
This seems to be a major downside currently as this heavily limits what I can do with any given model. As long as I am just using baseline SD checkpoints this is no problem, but as soon as I introduce a LoRA to the generation process, this becomes messy, and depending on the LoRa, unusable. As soon as I want to have multiple LoRAs, it basically is impossible.