Suggestion: Enable CUDA graph to speed up stable diffusion

kostum123 commented 1 year ago

Hi, I am a fan of ComfyUI and I love your stable diffusion GUI. It is amazing how you can create realistic images with a graph interface and no coding.

I have a suggestion for improving the performance of ComfyUI. Have you heard of CUDA graph compilation? It is a PyTorch feature that lets you capture and replay CUDA operations with less CPU overhead. It can make GPU workloads faster, especially if they have many short kernels or repeat many times.

I think CUDA graph compilation could help ComfyUI run faster and more efficiently. Stable diffusion involves many iterations and operations on the GPU, such as sampling, denoising, and attention. By using CUDA graph, ComfyUI could reduce the time and cost of launching each operation to the GPU, and achieve faster and more efficient image generation.

Could you please try to support CUDA graph in ComfyUI and see how it affects the speed and quality of image generation? I would really appreciate it if you could explore this possibility and share your results.

Thank you for your great work on ComfyUI. I hope you will consider my idea and let me know what you think.

NeedsMoar commented 1 year ago

Just pointing out that DirectML has the same graph API, although hopefully AMD just gets off their butts and finishes MIOpen

NeedsMoar commented 1 year ago

*I disassembled the torch_directml .pyd file to track down where it was putting the interfaces to the things it implements and it's basically the C files in the include subdirectory of torch. The graph stuff is there. DirectML.dll has more extensive functionality that can't be accessed directly unless I missed something, full functionality is with a module not on PyPi that must be built from the current NuGet DML package and the source on github called pydirectml; that module calls into it at a low level, so it's more like using DX12 (which is what it is).

The bigger problem with graph api is that it involves kinda extensive setup to create a graph that will optimize well, every tensor operation has to be added to the graph object as either an input or output, and the tensors shouldn't be created indirectly like most code does (they use the Tensor class instead), although I'm not sure how important this is. It might be needed to guarantee nothing else references the object. The whole tracing JIT thing with optimization which may or may not be on already is probably easier to implement.

While I was looking into that I found that MS (despite having access to the whole DX12 API from DirectML) uses https://github.com/intel/gpgmm/ to get free and total memory on the GPU for pydirectml. Don't know why, but that module would probably solve the issue of not being able to get total and free memory on AMD cards.

M1kep commented 1 year ago

How does this compare to AITemplates? Technology: https://github.com/facebookincubator/AITemplate/issues In Comfy: https://github.com/FizzleDorf/AIT

comfyanonymous / ComfyUI

Suggestion: Enable CUDA graph to speed up stable diffusion #1379