comfyanonymous / ComfyUI

The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface.
GNU General Public License v3.0
40.52k stars 4.32k forks source link

[feature] native and reliable TensorRT acceleration with torch-tensorrt module, possible? #2485

Open wingmrc opened 5 months ago

wingmrc commented 5 months ago

Hi there, I knew some Large Language Model code can be accelerated with torch 2.1.x+ torch.compile and with extra torch-tensorrt package. Just wondering if such kind of optimizations work for stable diffusion pipeline. If so, looking forward to see it got implemented in ComfyUI.

ltdrdata commented 5 months ago

There is a PoC level extension.

https://github.com/phineas-pta/comfy-trt-test

asagi4 commented 5 months ago

Also try https://github.com/gameltb/ComfyUI_stable_fast

the StableFastUnet acceleration works well and gives a noticeable speed boost. It has some TensorRT support too but it's actually slower in my experience (loading the compiled TensorRT engine seems pretty slow and causes latency)

wingmrc commented 5 months ago

Thanks for suggesting. Already tried Apply TensorRT Unet/Unet_Block in https://github.com/gameltb/ComfyUI_stable_fast for a few days.

It requires a preview version of tensorrt to work correctly (not really a problem) and took about 10+ mins to compile the cache SD1-5/SDXL model and the engine. 40gb+ ram (with mem leak?) during compile and operations. Getting x1.2~x1.4 speed-up after engine compile (RTX4070) and reload a significant time upon lora or resolution changes (cudagraph off). Seems not worth it currently unless running a simple static ComfyUI workflow. But still a great attempt.

Therefore, just wondering if there is a better implementation possible for tensorrt (or at least using part of tensorrt acceleration) that might be suitable for the dynamic nature of workflow in ComfyUI.

asagi4 commented 5 months ago

@wingmrc did you try the StableFast node? That works much better and is about as fast as TensorRT, at least with my 3060.

wingmrc commented 5 months ago

@wingmrc did you try the StableFast node? That works much better and is about as fast as TensorRT, at least with my 3060.

Had trouble getting the Apply StableFast Unet node to work with my existing workflow in NGC pytorch:23.08-py3 container. Keep getting AttributeError: module 'cv2.ocl' has no attribute 'setUseOpenCL' error (installed opencv-contrib-python==4.8.0.74 ) Not really familiar with OpenCV on CUDA.

Eventually pulled a separate container and did a torch 2.1.2 + xformers cu121 only install just to make the node working.

Can see a great 1.35x~1.45x speed improvement with cudagraph enabled; on RTX 4070. But it gets slower (cudagraph off) or even slower (cudagraph on) when models are too large and got moved out of VRAM.
Tested working with SD1-5 and SDXL in fp16 bf16 fp8_e4m3fn fp8_e5m2. Acceleration working smoothly with controlnet and loras as long as the main checkpoint is mostly kept inside VRAM (--highvram) with cudagraph enabled.

Can't wait to see future improvements.