Can transition from KSampler to VAE Decode be optimized?

okolenmi commented 1 year ago

I have a bad 4GB GPU, but it looks like this is almost enough to generate big images using this UI (a111's UI can't even start processing of something like this.) After 100% processing of 1920*1080 image in KSampler I have error messages: [The latest (today) test version of this ui] --normalvram.

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
!!! Exception during processing !!!
...
CUDA out of memory. Tried to allocate 1.98 GiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 2.79 GiB is allocated by PyTorch, and 556.89 MiB is reserved by PyTorch but unallocated.
...
CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.11 GiB is allocated by PyTorch, and 236.83 MiB is reserved by PyTorch but unallocated.

--lowvram (queue with 3 elements)

100%|██████████████████████████████████████████████████████████████████████████████████| 22/22 [07:05<00:00, 19.35s/it]
!!! Exception during processing !!!
Traceback (most recent call last):
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\nodes.py", line 241, in decode
    return (vae.decode(samples["samples"]), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\sd.py", line 626, in decode
    pixel_samples[x:x+batch_number] = torch.clamp((self.first_stage_model.decode(samples) + 1.0) / 2.0, min=0.0, max=1.0).cpu().float()
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\models\autoencoder.py", line 94, in decode
    dec = self.decoder(z)
          ^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 734, in forward
    h = nonlinearity(h)
        ^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 40, in nonlinearity
    return x*torch.sigmoid(x)
             ^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Prompt executed in 448.38 seconds
Exception in thread Thread-1 (prompt_worker):
Traceback (most recent call last):
  File "threading.py", line 1038, in _bootstrap_inner
  File "threading.py", line 975, in run
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\main.py", line 88, in prompt_worker
    comfy.model_management.soft_empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\model_management.py", line 554, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\cuda\memory.py", line 164, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

About --normalvram mode: I don't know how exactly the image is processed in VAE Decoder, but if it were possible to clear some memory after KSampler processing, it would allow everyone to generate larger images than usual.

The reasons of the failure seem to be different for normal and low memory modes. So I can't say nothing about --lowram mode.

fuami commented 1 year ago

Have you tried the newer Tiled Vae Decoder? I've had great success decoding much larger images using it with 4GB of VRAM

okolenmi commented 1 year ago

As I can see, in -- normalvram mode, there was an attempt to switch to tiled VAE decoder, but at the time of this attempt, it seems that the memory from the failed attempt was still not freed. Just disappeared from the reserve. Or... it can be other unknown (for me) consumer.

If you are talking about tiled VAE in a111 UI... well, I haven't tried it. It's a big hole of VRAM, anyway. First 2-4 images I can generate with 1000*1000 resolution, then after 20 generations I have not enough memory to generate even a 500*500 image. I need to manually restart server to generate something again. Restart in UI doesn't help at all. It looks like memory leaks.

Hmmm... new log in --normalvram mode this time looks different (single element in queue): after trying to use custom vae

100%|██████████████████████████████████████████████████████████████████████████████████| 22/22 [07:31<00:00, 20.50s/it]
making attention of type 'vanilla-pytorch' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-pytorch' with 512 in_channels
Global Step: 840001
!!! Exception during processing !!!
Traceback (most recent call last):
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\nodes.py", line 241, in decode
    return (vae.decode(samples["samples"]), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\sd.py", line 626, in decode
    pixel_samples[x:x+batch_number] = torch.clamp((self.first_stage_model.decode(samples) + 1.0) / 2.0, min=0.0, max=1.0).cpu().float()
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Prompt executed in 473.44 seconds
Exception in thread Thread-1 (prompt_worker):
Traceback (most recent call last):
  File "threading.py", line 1038, in _bootstrap_inner
  File "threading.py", line 975, in run
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\main.py", line 88, in prompt_worker
    comfy.model_management.soft_empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\model_management.py", line 554, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\cuda\memory.py", line 164, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

fuami commented 1 year ago

Sorry for the confusion, I was talking about the node "VAEDecodeTiled" ( https://github.com/comfyanonymous/ComfyUI/blob/master/nodes.py#L243 )

okolenmi commented 1 year ago

Thank you! This one works very nice (tested in --lowvram mode). This VAE Decoder should be as recomended for low ram profile users.

waltercool commented 1 year ago

Just a friendly reminder, 4GB is already under the lowest requirements. Track your memory usage with nvidia-smi or rocm-smi. Sometimes checkpoint, Lora and other assets may use too much memory into your VRAM.

Sometimes, even I face some random problems queuing images with ESRGANx4 under 10GB RAM.

ErebusAngelo commented 1 year ago

Thank you! This one works very nice (tested in --lowvram mode). This VAE Decoder should be as recomended for low ram profile users.

Could you tell me how to change the Vae decoder for the tiled version? I'm new to this, I'm trying everything and I can't do it, thank you in advance!

comfyanonymous / ComfyUI

Can transition from KSampler to VAE Decode be optimized? #1147

As I can see, in -- normalvram mode, there was an attempt to switch to tiled VAE decoder, but at the time of this attempt, it seems that the memory from the failed attempt was still not freed. Just disappeared from the reserve. Or... it can be other unknown (for me) consumer.