gameltb / ComfyUI_stable_fast

Experimental usage of stable-fast and TensorRT.
MIT License
197 stars 12 forks source link

Good job! I have helped make stable-fast work with it. #1

Open chengzeyi opened 11 months ago

chengzeyi commented 11 months ago

Hi, friend. I am the maintainer of stable-fast.

I occasionally found this project and tried it. Sadly it does not work out of the box because stable-fast is incompatible with it. I have adjusted my implementation to make it work and seen a noticeable performance improvement of 30% with batch size 4, 512x512 on SD 1.5. Is is not fine-tuned so there is definitely large room to improve the speed.

Hope we can work together to better integrate it.

gameltb commented 11 months ago

Thanks for your interest.

I built this node for testing. Hope it will be compatible with freeu even animatediff node. I'm not familiar with TorchScript. There are some issues when switching models or offloading to the CPU. Hope we can fix these issues and make it really usable.

chengzeyi commented 11 months ago

Thanks for your interest.

I built this node for testing. Hope it will be compatible with freeu even animatediff node. I'm not familiar with TorchScript. There are some issues when switching models or offloading to the CPU. Hope we can fix these issues and make it really usable.

Model offloading sounds incompatible with CUDA Graph and switching models seems to need recompilation. I would help if detailed usage examples can be provided😀.

gameltb commented 11 months ago

for a simple workflow like this image

If you run with one model and then switch to another, pytorch will fail at here This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?

chengzeyi commented 11 months ago

for a simple workflow like this image

If you run with one model and then switch to another, pytorch will fail at here This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?

Yes, I have written the following test to reproduce the bug. I think I would find a way to fix it soon. The initial purpose of this design is to reuse the memory pool for intermediate buffers allocated by all graphs on the same device, to save GPU VRAM.

import pytest

import gc
import torch
from sfast.cuda.graphs import (simple_make_graphed_callable, get_per_device_graph_execution_env)

def test_simple_make_graphed_callable():
    device = torch.device('cuda')

    def add(x, y):
        return x + y

    x = torch.randn(3, device=device)
    y = torch.randn(3, device=device)
    graphed_add = simple_make_graphed_callable(
        add, example_inputs=(x, y)
    )
    expected_output = add(x, y)
    assert torch.allclose(graphed_add(x, y), expected_output)

    execution_env = get_per_device_graph_execution_env(device.index)
    tmp_graph = torch.cuda.CUDAGraph()
    with torch.cuda.device(execution_env.device), torch.cuda.stream(execution_env.stream):
        with torch.cuda.graph(tmp_graph, pool=execution_env.mempool, stream=execution_env.stream):
            x = torch.randn(3, device=device)
    # Hold a reference to a tensor allocated from the pool
    # so that it doesn't get cleared from graph_pools_freeble in CUDACachingAllocator
    # if its use_count becomes 0.
    # To reproduce the following runtime error because of all graphs using the same pool getting freed:
    # >       super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
    # E       RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":2103, please report a bug to PyTorch.
    del graphed_add, tmp_graph
    gc.collect()

    graphed_add = simple_make_graphed_callable(
        add, example_kwarg_inputs={'x': x, 'y': y}
    )
    expected_output = add(x, y)
    assert torch.allclose(graphed_add(x=x, y=y), expected_output)
Soumyajit commented 11 months ago

I think I managed to install this correctly (no errors with installation, also Import successful during ComfyUI launch) but where do I find this node: "Apply StableFast Unet" ?

Soumyajit commented 11 months ago

Ok I found it under Loaders: https://github.com/gameltb/ComfyUI_stable_fast/blob/a222c9491f001f093ebc117fee311cecb488395a/node.py#L74 Had to restart ComfyUI twice.

Soumyajit commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

gameltb commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

chengzeyi commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.

By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.

gameltb commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.

By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.

Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.

chengzeyi commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.

By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.

Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.

It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors.

In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.

gameltb commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.

By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.

Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.

It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors.

In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.

Thanks for the explanation. While it won't make inference faster, it's still useful under comfy, for example, when a large enough picture is inferred using the unet model, the remaining memory may not be enough for vae to decode it, so the memory used by unet needs to be released, if the memory it uses can only be freed by deleting it, then the module will have to be recompiled next time unet inference, which will make sfast meaningless in this case.

chengzeyi commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Looks like a recent change in stable-fast has made it no longer work with cudaMallocAsync.

Seems like the recent change to save memory has made CUDA Graph incompatible with cudaMallocAsync. I would inspect and fix it.

By the way, when CUDA graph is enabled, there is no need to dynamically allocate GPU memory when running the UNet. So enabling cudaMallocAsync won't bring any significant performance improvement in this case.

Thank you. I would like to ask if when cuda_graph and jit_freeze are disabled, can the weights of compiled module dependencies be moved from GPU to CPU? And move back to the GPU when needed.

It should be legal but should not save memory or speed up the inference. Because the compiled graph still holds references to those tensors. In fact, tensors are not movable. By "moving" tensors from GPU to CPU, the system creates copies of the original tensors on another device. The original tensor still exists unless there is no reference to it and the GC collects it.

Thanks for the explanation. While it won't make inference faster, it's still useful under comfy, for example, when a large enough picture is inferred using the unet model, the remaining memory may not be enough for vae to decode it, so the memory used by unet needs to be released, if the memory it uses can only be freed by deleting it, then the module will have to be recompiled next time unet inference, which will make sfast meaningless in this case.

I think in most cases, weights only occupy a small ratio of the total memory consumed by inference and most GPU memory are consumed by intermediate tensors during the inference. And if CUDA Graph is not used, the UNet and VAE could share a global memory pool, and VAE can reuse the memory previously consumed by UNet. Only if CUDA Graph is enabled, a dedicated memory pool would be created for UNet to allocate its intermediate tensors and that cannot be reused by VAE (unless you also capture VAE into CUDA Graph and shares the same memory pool).

chengzeyi commented 11 months ago

Did not work until I added this flag: --disable-cuda-malloc. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

I have fixed that problem to make stable-fast's CUDA Graph compatible with cudaMallocAsync in the nightly release.

It is expected to have better performance improvement when Triton is enabled @Soumyajit @gameltb. But I sometimes experience weird illegal memory access or invalid argument bug with Triton or CUDA Graph on WSL2. When I switch to pure Linux, everything just works fine. I am not sure if this is related with some CUDA driver bugs on WSL or Windows, sadly.

Soumyajit commented 11 months ago

Thanks for the info. I will try to update with the nightly release and re-check. A few comments from my side:

  1. I am not sure if Triton is enabled in my case (is some flag required??). But I definitely did install Triton into the virtual environment (without errors) and import triton also works without problems.
  2. I installed cuda 12.1 in Windows but did not specifically install this in WSL assuming that it would automatically load from windows.
  3. Installed pyTorch with pip install and without the --index-url wheel but I think that my PyTorch is with cu121
  4. Installed the stable_fast-0.0.9+torch210cu121-cp310-cp310-manylinux2014_x86_64.whl for WSL

Let me know if I missed out / made errors on something.