Vchitect / VEnhancer

Official codes of VEnhancer: Generative Space-Time Enhancement for Video Generation
https://vchitect.github.io/VEnhancer-project/
421 stars 23 forks source link

OutOfMemoryError: CUDA out of memory with 24GB. Is it possible to apply torchao quantization? #20

Open loretoparisi opened 3 weeks ago

loretoparisi commented 3 weeks ago

I'm getting a OOM running the

python enhance_a_video.py \
--version v2 \
--up_scale 4 --target_fps 24 --noise_aug 250 \
--solver_mode 'fast' --steps 15 \
--input_path 'prompts/' \
--prompt_path 'prompts/text_prompts.txt' \
--save_dir 'results/' \
--model_path 'ckpts/venhancer_v2.pt'

using the

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      On  | 00000000:35:00.0 Off |                    0 |
| N/A   47C    P0              20W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.13 GiB (GPU 0; 21.96 GiB total capacity; 18.17 GiB already allocated; 837.06 MiB free; 20.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is it possibile to apply Bf16 quantization? My approach using CogVideoX to run in 24GB is to Tiled VAE Decoding, Sliced VAE Decoding plus CPU offload and running the pipe in BF16. To quantize the model I use torchao:

from torchao.quantization import quantize_, int8_weight_only
from torchao.float8.inference import ActivationCasting, QuantConfig, quantize_to_float8
def quantize_model(part, quantization_scheme):
    if quantization_scheme == "int8":
        quantize_(part, int8_weight_only())
    elif quantization_scheme == "fp8":
        quantize_to_float8(part, QuantConfig(ActivationCasting.DYNAMIC))
    return part

Not sure that this can be applied to your model too.

hejingwenhejingwen commented 3 weeks ago

It is already fp16. We have also used tiled and sliced VAE decoding. You can use multiple GPU inference if you have multiple GPUs. But now the parallel inference for VAE decoding is not supported, so 24GB may still not be enough.
We will work on parallel inference for VAE decoding in the future.

loretoparisi commented 3 weeks ago

It is already fp16. We have also used tiled and sliced VAE decoding. You can use multiple GPU inference if you have multiple GPUs. But now the parallel inference for VAE decoding is not supported, so 24GB may still not be enough. We will work on parallel inference for VAE decoding in the future.

Okay thank you. If I use 4x GB 24 GB it should work considering slicing enabled etc?

hejingwenhejingwen commented 3 weeks ago

Not sure. I think the VAE part is more likely to induce OOM, you can decline the tile size (f, h, w) through VEnhancer/video_to_video/video_to_video_model_parallel.py line172~174

截屏2024-09-17 23 40 39

And you can change the chunk size (frame length for one chunk). Now the chunk size is set to 32, you can use 24 or lower. But now for frame length less than 32, we only use one chunk. There is some restrictions in VEnhancer/video_to_video/video_to_video_model_parallel.py, please comment it.

截屏2024-09-17 23 50 14

To change the chunk size, please go there: https://github.com/Vchitect/VEnhancer/blob/80ffaa33988c583b129b730ce9d559b114de2d8c/video_to_video/utils/util.py#L31

It's quite annoying, I will make these visible for users, by providing more configuration parameters in the command script.

loretoparisi commented 3 weeks ago

So thank you, while I'm trying to adjust the chunk size I did a multi-gpu test:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   48C    P0             216W / 300W |  20644MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   44C    P0             220W / 300W |  20644MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   43C    P0             223W / 300W |  20644MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   44C    P0             215W / 300W |  20644MiB / 23028MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I got OOM, but after some processing:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.81 GiB (GPU 3; 22.19 GiB total capacity; 17.28 GiB already allocated; 2.65 GiB free; 19.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2751 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2752 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2753 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 2754) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
enhance_a_video_MultiGPU.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-17_16:15:25
  host      : 
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2754)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

whole stacktrace was

vae/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 7.14MB/s]
diffusion_pytorch_model.fp16.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 196M/196M [00:00<00:00, 385MB/s]
2024-09-17 16:09:34,646 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:34,646 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:34,693 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:34,693 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:34,693 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:34,733 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:34,733 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:34,733 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:34,733 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:34,781 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,113 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,113 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,143 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,143 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,143 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,143 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,143 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,172 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,172 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,172 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,182 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,182 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,182 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,182 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,189 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,211 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,212 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,212 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,212 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,218 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,425 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,425 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,454 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,454 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,454 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,493 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,493 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,494 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,494 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,500 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:10:00,041 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,863 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:00,871 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,887 - video_to_video - INFO - step: 0
2024-09-17 16:10:01,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:13,071 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:47,013 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:58,293 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:20,867 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,165 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:28,542 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:39,849 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:24,969 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:43,985 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:17,860 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,937 - video_to_video - INFO - sampling, finished.
hejingwenhejingwen commented 3 weeks ago

It seems that you have already finished sampling for diffusion part, so it is because of VAE decoding. Please go there: https://github.com/Vchitect/VEnhancer/blob/80ffaa33988c583b129b730ce9d559b114de2d8c/video_to_video/video_to_video_model_parallel.py#L172 For example, you can make some modifications: self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768

SamitM1 commented 3 weeks ago

Would those modifications reduce the quality of the output or just slow down processing.

hejingwenhejingwen commented 3 weeks ago

Would those modifications reduce the quality of the output or just slow down processing.

I don't see obvious quality loss, but I've just tested several samples.

loretoparisi commented 3 weeks ago

@hejingwenhejingwen 🥇 it worked!

https://github.com/user-attachments/assets/27d6707d-285c-4dc4-8874-c59085302308

https://github.com/user-attachments/assets/54df614b-88fc-4512-b972-8210497047c7

It seems that you have already finished sampling for diffusion part, so it is because of VAE decoding. Please go there:

https://github.com/Vchitect/VEnhancer/blob/80ffaa33988c583b129b730ce9d559b114de2d8c/video_to_video/video_to_video_model_parallel.py#L172

For example, you can make some modifications: self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768

Thank you, so I have applied these modifications:

passed max_chunk_len as parameter to the make_chunks

video_to_video/utils/util.py

def make_chunks(f_num, interp_f_num, chunk_overlap_ratio=0.5, max_chunk_len = 32):
    MAX_O_LEN = max_chunk_len * chunk_overlap_ratio
    chunk_len = int((max_chunk_len - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
    o_len = int((MAX_O_LEN - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
    chunk_inds = sliding_windows_1d(f_num, chunk_len, o_len)
    return chunk_inds

in order to adjust to 24 by example video_to_video_model.py

max_chunk_len = 24 # 32
torch.cuda.empty_cache()
chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len)

in the same way as you suggested I passed to tiled_chunked_decode those params

(video_to_video_model_parallel.py)

logger.info(f"sampling, finished.")
frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768 
gen_video = self.tiled_chunked_decode(gen_vid, 
        frame_chunk_size=frame_chunk_size, 
        tile_img_height=tile_img_height, 
        tile_img_width=tile_img_width)

NOTES. I have also added pip install accelerate and to eventually to solve the ImportError: libGL.so.1: cannot open shared object file: No such file or directory that may happen un Ubuntu I did (this also may happen on CogVideoX on some platforms) sudo apt-get update && sudo apt-get install ffmpeg libsm6 libxext6 -y

loretoparisi commented 3 weeks ago

@hejingwenhejingwen other tests. I'm trying the CogVideoX generation now, and the OOM is caused as for your detailed description above to the higher number of frames (defaults to 49 frames)

2024-09-18 09:56:18,427 - video_to_video - INFO - checkpoint_path: ckpts/venhancer_v2.pt
2024-09-18 09:56:30,356 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
2024-09-18 09:56:49,183 - video_to_video - INFO - Load model path ckpts/venhancer_v2.pt, with local status <All keys matched successfully>
2024-09-18 09:56:49,184 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-09-18 09:56:49,966 - video_to_video - INFO - Load model path ckpts/venhancer_v2.pt, with local status <All keys matched successfully>
2024-09-18 09:56:49,967 - video_to_video - INFO - Build diffusion with GaussianDiffusion

and the image frames info:

2024-09-18 09:56:50,232 - video_to_video - INFO - input frames length: 49
2024-09-18 09:56:50,232 - video_to_video - INFO - input fps: 12.0
2024-09-18 09:56:50,248 - video_to_video - INFO - target_fps: 24.0
2024-09-18 09:56:50,503 - video_to_video - INFO - input resolution: (480, 720)
2024-09-18 09:56:50,503 - video_to_video - INFO - target resolution: (1254, 1880)
2024-09-18 09:56:50,503 - video_to_video - INFO - noise augmentation: 250
2024-09-18 09:56:50,503 - video_to_video - INFO - scale s is set to: 4.0
2024-09-18 09:56:50,535 - video_to_video - INFO - video_data shape: torch.Size([97, 3, 1254, 1880])

In this case I get the OOM, I would say in the vae decode step:

video_data_feature = self.vae_encode(video_data)

stacktrace:

  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 334, in encode
    h = self.encoder(x)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/vae.py", line 175, in forward
    sample = self.mid_block(sample)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 738, in forward
    hidden_states = attn(hidden_states, temb=temb)
  File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
  File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 2216, in __call__
    hidden_states = F.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.63 GiB (GPU 2; 22.19 GiB total capacity; 18.40 GiB already allocated; 2.31 GiB free; 19.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 77142) of binary: /usr/bin/python3

In this case I have tried to handle in 4x24GB the 49 frames, the setup was to use a chunking of 4/12 frames per chunk in the make_chunks - without success:

max_chunk_len = 12
torch.cuda.empty_cache()
 chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len)

While I have ketp the setup for the tiled_chunked_decode, but the code is not reached here in fact

frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768 
gen_video = self.tiled_chunked_decode(gen_vid, 
        frame_chunk_size=frame_chunk_size, 
        tile_img_height=tile_img_height, 
        tile_img_width=tile_img_width)

So, there is a rule of thumb to do the VRAM requirements calculations by image H, W given FPS, STEPS to set max_chunk_len in advance? Thank you!

hejingwenhejingwen commented 3 weeks ago

It is because of VAE encoding, I will add sliced encoding to avoid this.

loretoparisi commented 3 weeks ago

It is because of VAE encoding, I will add sliced encoding to avoid this.

Great, I was looking infact to this impl for the AutoencoderKLCogVideoX class. Interestingly authors wrote down vram needs notes:

# Rough memory assessment:
#   - In CogVideoX-2B, there are a total of 24 CausalConv3d layers.
#   - The biggest intermediate dimensions are: [1, 128, 9, 480, 720].
#   - Assume fp16 (2 bytes per value).
# Memory required: 1 * 128 * 9 * 480 * 720 * 24 * 2 / 1024**3 = 17.8 GB
#
# Memory assessment when using tiling:
#   - Assume everything as above but now HxW is 240x360 by tiling in half
# Memory required: 1 * 128 * 9 * 240 * 360 * 24 * 2 / 1024**3 = 4.5 GB
hejingwenhejingwen commented 3 weeks ago

Thanks, actually it is okay to process 31 frames, but OOM with 97 frames. So the problem is too many frames. We actually already encode the frames one by one, but it is still OOM. Besides sliced and tiled VAE encoding, you can make chunks of these frames, and process each chunk separately for both VAE encoding and all sampling steps. The existing code can only make chunks for each sampling step. That is, all frames are split before denoising, and then be merged together after denoising.

loretoparisi commented 2 weeks ago

Thanks, actually it is okay to process 31 frames, but OOM with 97 frames. So the problem is too many frames. We actually already encode the frames one by one, but it is still OOM. Besides sliced and tiled VAE encoding, you can make chunks of these frames, and process each chunk separately for both VAE encoding and all sampling steps. The existing code can only make chunks for each sampling step. That is, all frames are split before denoising, and then be merged together after denoising.

ok thank you very much @hejingwenhejingwen ! For slicing, I was trying the encode part that should be like

    @apply_forward_hook
    def encode(
        self, x: torch.Tensor, return_dict: bool = True
    ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
        """
        Encode a batch of images into latents.

        Args:
            x (`torch.Tensor`): Input batch of images.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
                tuple.

        Returns:
                The latent representations of the encoded images. If `return_dict` is True, a
                [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
                returned.
        """

        # LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
        #h = self.encoder(x)
        if self.use_slicing and x.shape[0] > 1:
            encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
            h = torch.cat(encoded_slices)
        else:
            h = self.encoder(x)

        moments = self.quant_conv(h)
        posterior = DiagonalGaussianDistribution(moments)

        if not return_dict:
            return (posterior,)

        return AutoencoderKLOutput(latent_dist=posterior)

while the decode part since you have that num_frames, it seems more complicated, I'm getting some dimensionality errors here if I do as simple as:

   @apply_forward_hook
    def decode(
        self,
        z: torch.Tensor,
        num_frames: int,
        return_dict: bool = True,
    ) -> Union[DecoderOutput, torch.Tensor]:
        """
        Decode a batch of images.

        Args:
            z (`torch.Tensor`): Input batch of latent vectors.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.

        Returns:
            [`~models.vae.DecoderOutput`] or `tuple`:
                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
                returned.

        """
        batch_size = z.shape[0] // num_frames
        image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)

        # LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
        #decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)

        if self.use_slicing and z.shape[0] > 1:
            decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z.split(1)]
            decoded = torch.cat(decoded_slices)
        else:
            decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)

        if not return_dict:
            return (decoded,)

        return DecoderOutput(sample=decoded)

so I think my error was due to the fact that I have to consider num_frames in the split...trying this now!

loretoparisi commented 2 weeks ago

okay the decode part with slicing is done! 🥇 and it works without issues completing the interpolation correctly.

    @apply_forward_hook
    def decode(
        self,
        z: torch.Tensor,
        num_frames: int,
        return_dict: bool = True,
    ) -> Union[DecoderOutput, torch.Tensor]:
        """
        Decode a batch of images.

        Args:
            z (`torch.Tensor`): Input batch of latent vectors.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.

        Returns:
            [`~models.vae.DecoderOutput`] or `tuple`:
                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
                returned.
        """
        batch_size = z.shape[0] // num_frames
        image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)

        if self.use_slicing and z.shape[0] > 1:
            # Split the tensor based on the number of frames, not into individual slices
            z_slices = torch.split(z, num_frames)
            decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z_slices]
            decoded = torch.cat(decoded_slices, dim=0)  # Concatenate along the batch dimension
        else:
            decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)

        if not return_dict:
            return (decoded,)

        return DecoderOutput(sample=decoded)
loretoparisi commented 2 weeks ago

@hejingwenhejingwen I realized the encode was not that good, 'cause I was ignorin batch size was 1, so I took into consideration H and W.

@apply_forward_hook
    def encode(
        self, x: torch.Tensor, return_dict: bool = True
    ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
        """
        Encode a batch of images into latents.

        Args:
            x (`torch.Tensor`): Input batch of images.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
                tuple.

        Returns:
                The latent representations of the encoded images. If `return_dict` is True, a
                [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
                returned.
        """

        # x.shape  [1, 3, 1296, 1920] [B, C, W, H]
        w_patch_size=128
        h_patch_size=128
        if self.use_slicing and (x.shape[2] > w_patch_size or x.shape[3] > h_patch_size): 
            h_slices = []
            height_splits = x.split(h_patch_size, dim=2) 
            for h_slice in height_splits:
                width_splits = h_slice.split(w_patch_size, dim=3)
                encoded_width_slices = [self.encoder(w_slice) for w_slice in width_splits]
                h_slices.append(torch.cat(encoded_width_slices, dim=3))
            h = torch.cat(h_slices, dim=2)
        else:
            h = self.encoder(x)

        moments = self.quant_conv(h)
        posterior = DiagonalGaussianDistribution(moments)

        if not return_dict:
            return (posterior,)

        return AutoencoderKLOutput(latent_dist=posterior)

While it goes on and finish the video, while keeping the memory, the quality degrades as you can see from here:

https://github.com/user-attachments/assets/7042e217-bcee-4664-8b7f-ee674e71c74d

hejingwenhejingwen commented 2 weeks ago

@hejingwenhejingwen I realized the encode was not that good, 'cause I was ignorin batch size was 1, so I took into consideration H and W.

@apply_forward_hook
    def encode(
        self, x: torch.Tensor, return_dict: bool = True
    ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
        """
        Encode a batch of images into latents.

        Args:
            x (`torch.Tensor`): Input batch of images.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
                tuple.

        Returns:
                The latent representations of the encoded images. If `return_dict` is True, a
                [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
                returned.
        """

        # x.shape  [1, 3, 1296, 1920] [B, C, W, H]
        w_patch_size=128
        h_patch_size=128
        if self.use_slicing and (x.shape[2] > w_patch_size or x.shape[3] > h_patch_size): 
            h_slices = []
            height_splits = x.split(h_patch_size, dim=2) 
            for h_slice in height_splits:
                width_splits = h_slice.split(w_patch_size, dim=3)
                encoded_width_slices = [self.encoder(w_slice) for w_slice in width_splits]
                h_slices.append(torch.cat(encoded_width_slices, dim=3))
            h = torch.cat(h_slices, dim=2)
        else:
            h = self.encoder(x)

        moments = self.quant_conv(h)
        posterior = DiagonalGaussianDistribution(moments)

        if not return_dict:
            return (posterior,)

        return AutoencoderKLOutput(latent_dist=posterior)

While it goes on and finish the video, while keeping the memory, the quality degrades as you can see from here:

iron_man.2.mp4

The patch size should be larger (e.g., 512x512), and has overlap (e.g., 128).