Open wingmrc opened 5 months ago
There is a PoC level extension.
Also try https://github.com/gameltb/ComfyUI_stable_fast
the StableFastUnet acceleration works well and gives a noticeable speed boost. It has some TensorRT support too but it's actually slower in my experience (loading the compiled TensorRT engine seems pretty slow and causes latency)
Thanks for suggesting.
Already tried Apply TensorRT Unet/Unet_Block
in https://github.com/gameltb/ComfyUI_stable_fast for a few days.
It requires a preview version of tensorrt
to work correctly (not really a problem) and took about 10+ mins to compile the cache SD1-5/SDXL model and the engine. 40gb+ ram (with mem leak?) during compile and operations. Getting x1.2~x1.4 speed-up after engine compile (RTX4070) and reload a significant time upon lora or resolution changes (cudagraph off).
Seems not worth it currently unless running a simple static ComfyUI workflow. But still a great attempt.
Therefore, just wondering if there is a better implementation possible for tensorrt (or at least using part of tensorrt acceleration) that might be suitable for the dynamic nature of workflow in ComfyUI.
@wingmrc did you try the StableFast node? That works much better and is about as fast as TensorRT, at least with my 3060.
@wingmrc did you try the StableFast node? That works much better and is about as fast as TensorRT, at least with my 3060.
Had trouble getting the Apply StableFast Unet
node to work with my existing workflow in NGC pytorch:23.08-py3
container.
Keep getting
AttributeError: module 'cv2.ocl' has no attribute 'setUseOpenCL'
error
(installed opencv-contrib-python==4.8.0.74
) Not really familiar with OpenCV on CUDA.
Eventually pulled a separate container and did a torch 2.1.2 + xformers
cu121 only install just to make the node working.
Can see a great 1.35x~1.45x
speed improvement with cudagraph enabled; on RTX 4070.
But it gets slower (cudagraph off) or even slower (cudagraph on) when models are too large and got moved out of VRAM.
Tested working with SD1-5 and SDXL in fp16
bf16
fp8_e4m3fn
fp8_e5m2
.
Acceleration working smoothly with controlnet and loras as long as the main checkpoint is mostly kept inside VRAM (--highvram
) with cudagraph enabled.
Can't wait to see future improvements.
Hi there, I knew some Large Language Model code can be accelerated with
torch 2.1.x
+torch.compile
and with extratorch-tensorrt
package. Just wondering if such kind of optimizations work for stable diffusion pipeline. If so, looking forward to see it got implemented in ComfyUI.