Open elliotgsy opened 1 year ago
Hello. May I ask how did you got it to run on windows? Were the OSS plugins required?
I'm curious if this would affect the actual generated image, since it says it optimizes CLIP which can really change how a gen turns out depending on the value. Would still love to have as an option though. 60 steps in under a second is crazy.
Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith
Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.
Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith
Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.
I installed the docker release of Lsmith and i can confirm, it works well. Lacks most features automatic1111's UI has. But BOY is it fast. If i knew how to help to implement, i would help. If there are some tasks that i could do to help, i'd get on it of my free time. Because when running batches on a1111 i got ~15 it/s. On Lsmith running single image generations i got 35 it/s. This would be a big improvement if it got implemented. For now i wont be using Lsmith because it's quite raw. There are resolution limits (1024x1024), no hires fix or face restore. https://i.imgur.com/5FDD7Yb.png
Haven't tried windows native install yet, but will do as soon as i get time.
Still interest in this? I definitely am.
Yes, i am trying and mostly its providing x2 speed on my laptot 3070
@AUTOMATIC1111 Would it be possible to add this to the current code?
Has anyone compared TensorRT and Olive (Direct-ML)? They advertise similar performance.
Is there an existing issue for this?
What would your feature do ?
As shown on the the SDA: Node repository. Supported NVIDIA systems can achieve inference speeds up to x4 over native pytorch utilising NVIDIA TensorRT.
Their demodiffusion.py file and text to image file (t2i.py) provides a good example of how this is used. This is mainly based off the Nvidia diffusion demo folder.
This speedup happens through compiling it into a highly optimised version that can be run on Nvidia GPUs, e.g.
https://huggingface.co/tensorrt/Anything-V3/tree/main
This splits out the CLIP, UNET and VAE into
.plan
files, these are serialized TensorRT engine files which contain the parameters of the optimized model. Running these through the TensorRT runtime provides additional restrictions on resolution and batch size as talked about on the SDA: Node README.Usage is dependent on the following (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html):
tensorrt
for linux.Examples of Implementations:
I've managed to build
sda-node
for Linux and test TensorRT on Windows and can confirm around a ~x3 speedup on my own system compared to inference inAUTOMATIC-1111
.Implementation would be dependent on loading model(s)
.plan
files into a runtime with the TensorRT engine and plugins loaded, including synchronizing CUDA and PyTorch, etc.I'm unsure if this would be in-scope for this project. Potentially too much stuff that would not run on this runtime, meaning it'd make more sense to just use a separate UI when wanting to utilise NVIDIA TensorRT models. People may find a number of features they'd expect to work no longer work on the tensor runtime. It's also another mode of operation that is hard to support and will need to be maintained. Just proposing in a cleaner way before anyone else does, it could be this is already in the works :-)
Proposed workflow
As I'm unsure of this project's structure and goals. I'm unsure of the viability or implementation path.
TensorRT
enabled by a possible flag say-TENNSORRT
/models/
or/inbuilt-extensions/tensorrt/models
)Additional information
I'll likely be messing around with my own a bit with this, but I have limited knowledge with pytorch/stable diffusion/cuda, so I wouldn't expect any MR from myself.