Open chengzeyi opened 11 months ago
Thanks for your interest.
I built this node for testing. Hope it will be compatible with freeu even animatediff node. I'm not familiar with TorchScript. There are some issues when switching models or offloading to the CPU. Hope we can fix these issues and make it really usable.
Thanks for your interest.
I built this node for testing. Hope it will be compatible with freeu even animatediff node. I'm not familiar with TorchScript. There are some issues when switching models or offloading to the CPU. Hope we can fix these issues and make it really usable.
Model offloading sounds incompatible with CUDA Graph and switching models seems to need recompilation. I would help if detailed usage examples can be provided😀.
for a simple workflow like this
If you run with one model and then switch to another, pytorch will fail at here This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?
for a simple workflow like this
If you run with one model and then switch to another, pytorch will fail at here This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?
Yes, I have written the following test to reproduce the bug. I think I would find a way to fix it soon. The initial purpose of this design is to reuse the memory pool for intermediate buffers allocated by all graphs on the same device, to save GPU VRAM.
import pytest
import gc
import torch
from sfast.cuda.graphs import (simple_make_graphed_callable, get_per_device_graph_execution_env)
def test_simple_make_graphed_callable():
device = torch.device('cuda')
def add(x, y):
return x + y
x = torch.randn(3, device=device)
y = torch.randn(3, device=device)
graphed_add = simple_make_graphed_callable(
add, example_inputs=(x, y)
)
expected_output = add(x, y)
assert torch.allclose(graphed_add(x, y), expected_output)
execution_env = get_per_device_graph_execution_env(device.index)
tmp_graph = torch.cuda.CUDAGraph()
with torch.cuda.device(execution_env.device), torch.cuda.stream(execution_env.stream):
with torch.cuda.graph(tmp_graph, pool=execution_env.mempool, stream=execution_env.stream):
x = torch.randn(3, device=device)
# Hold a reference to a tensor allocated from the pool
# so that it doesn't get cleared from graph_pools_freeble in CUDACachingAllocator
# if its use_count becomes 0.
# To reproduce the following runtime error because of all graphs using the same pool getting freed:
# > super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
# E RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":2103, please report a bug to PyTorch.
del graphed_add, tmp_graph
gc.collect()
graphed_add = simple_make_graphed_callable(
add, example_kwarg_inputs={'x': x, 'y': y}
)
expected_output = add(x, y)
assert torch.allclose(graphed_add(x=x, y=y), expected_output)
I think I managed to install this correctly (no errors with installation, also Import successful during ComfyUI launch) but where do I find this node: "Apply StableFast Unet" ?
Ok I found it under Loaders: https://github.com/gameltb/ComfyUI_stable_fast/blob/a222c9491f001f093ebc117fee311cecb488395a/node.py#L74 Had to restart ComfyUI twice.
Did not work until I added this flag: --disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)
Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.
By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.
By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.
Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.
By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.
Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.
It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors.
In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.
By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.
Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.
It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors.
In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.
Thanks for the explanation. While it won't make inference faster, it's still useful under comfy, for example, when a large enough picture is inferred using the unet model, the remaining memory may not be enough for vae to decode it, so the memory used by unet needs to be released, if the memory it uses can only be freed by deleting it, then the module will have to be recompiled next time unet inference, which will make sfast meaningless in this case.
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.
Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.
By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.
Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.
It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors. In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.
Thanks for the explanation. While it won't make inference faster, it's still useful under comfy, for example, when a large enough picture is inferred using the unet model, the remaining memory may not be enough for vae to decode it, so the memory used by unet needs to be released, if the memory it uses can only be freed by deleting it, then the module will have to be recompiled next time unet inference, which will make sfast meaningless in this case.
I think in most cases, weights only occupy a small ratio of the total memory consumed by inference and most GPU memory are consumed by intermediate tensors during the inference. And if CUDA Graph is not used, the UNet and VAE could share a global memory pool, and VAE can reuse the memory previously consumed by UNet. Only if CUDA Graph is enabled, a dedicated memory pool would be created for UNet to allocate its intermediate tensors and that cannot be reused by VAE (unless you also capture VAE into CUDA Graph and shares the same memory pool).
Did not work until I added this flag:
--disable-cuda-malloc
. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)
I have fixed that problem to make stable-fast
's CUDA Graph compatible with cudaMallocAsync
in the nightly release.
It is expected to have better performance improvement when Triton
is enabled @Soumyajit @gameltb. But I sometimes experience weird illegal memory access
or invalid argument
bug with Triton
or CUDA Graph
on WSL2
. When I switch to pure Linux, everything just works fine. I am not sure if this is related with some CUDA driver bugs on WSL
or Windows
, sadly.
Thanks for the info. I will try to update with the nightly release and re-check. A few comments from my side:
import triton
also works without problems.--index-url
wheel but I think that my PyTorch is with cu121Let me know if I missed out / made errors on something.
Hi, friend. I am the maintainer of stable-fast.
I occasionally found this project and tried it. Sadly it does not work out of the box because stable-fast is incompatible with it. I have adjusted my implementation to make it work and seen a noticeable performance improvement of 30% with batch size 4, 512x512 on SD 1.5. Is is not fine-tuned so there is definitely large room to improve the speed.
Hope we can work together to better integrate it.