Open loretoparisi opened 3 weeks ago
It is already fp16. We have also used tiled and sliced VAE decoding.
You can use multiple GPU inference if you have multiple GPUs. But now the parallel inference for VAE decoding is not supported, so 24GB may still not be enough.
We will work on parallel inference for VAE decoding in the future.
It is already fp16. We have also used tiled and sliced VAE decoding. You can use multiple GPU inference if you have multiple GPUs. But now the parallel inference for VAE decoding is not supported, so 24GB may still not be enough. We will work on parallel inference for VAE decoding in the future.
Okay thank you. If I use 4x GB 24 GB it should work considering slicing enabled etc?
Not sure. I think the VAE part is more likely to induce OOM, you can decline the tile size (f, h, w) through VEnhancer/video_to_video/video_to_video_model_parallel.py line172~174
And you can change the chunk size (frame length for one chunk). Now the chunk size is set to 32, you can use 24 or lower. But now for frame length less than 32, we only use one chunk. There is some restrictions in VEnhancer/video_to_video/video_to_video_model_parallel.py, please comment it.
To change the chunk size, please go there: https://github.com/Vchitect/VEnhancer/blob/80ffaa33988c583b129b730ce9d559b114de2d8c/video_to_video/utils/util.py#L31
It's quite annoying, I will make these visible for users, by providing more configuration parameters in the command script.
So thank you, while I'm trying to adjust the chunk size I did a multi-gpu test:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 48C P0 216W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 44C P0 220W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 43C P0 223W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 44C P0 215W / 300W | 20644MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
I got OOM, but after some processing:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.81 GiB (GPU 3; 22.19 GiB total capacity; 17.28 GiB already allocated; 2.65 GiB free; 19.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2751 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2752 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2753 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 2754) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
main()
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/coder/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
enhance_a_video_MultiGPU.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-17_16:15:25
host :
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2754)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
whole stacktrace was
vae/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 7.14MB/s]
diffusion_pytorch_model.fp16.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 196M/196M [00:00<00:00, 385MB/s]
2024-09-17 16:09:34,646 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:34,646 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:34,693 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:34,693 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:34,693 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:34,733 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:34,733 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:34,733 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:34,733 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:34,781 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,113 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,113 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,143 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,143 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,143 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,143 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,143 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,172 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,172 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,172 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,182 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,182 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,182 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,182 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,189 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,211 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,212 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,212 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,212 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,218 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:09:35,425 - video_to_video - INFO - processing video 0, file_path: prompts/astronaut.mp4
2024-09-17 16:09:35,425 - video_to_video - INFO - text: An astronaut flying in space, featuring a steady and smooth perspective
2024-09-17 16:09:35,454 - video_to_video - INFO - input frames length: 16
2024-09-17 16:09:35,454 - video_to_video - INFO - input fps: 10.0
2024-09-17 16:09:35,454 - video_to_video - INFO - target_fps: 20.0
2024-09-17 16:09:35,493 - video_to_video - INFO - input resolution: (320, 512)
2024-09-17 16:09:35,493 - video_to_video - INFO - target resolution: (1214, 1942)
2024-09-17 16:09:35,494 - video_to_video - INFO - noise augmentation: 250
2024-09-17 16:09:35,494 - video_to_video - INFO - scale s is set to: 8
2024-09-17 16:09:35,500 - video_to_video - INFO - video_data shape: torch.Size([31, 3, 1214, 1942])
2024-09-17 16:10:00,041 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,863 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:00,865 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:00,871 - video_to_video - INFO - step: 0
2024-09-17 16:10:00,887 - video_to_video - INFO - step: 0
2024-09-17 16:10:01,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:13,070 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:13,071 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,348 - video_to_video - INFO - step: 1
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:24,367 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:35,714 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:46,982 - video_to_video - INFO - step: 2
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:47,012 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:47,013 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:10:58,292 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:10:58,293 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,568 - video_to_video - INFO - step: 3
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:09,600 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:20,867 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:20,868 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,138 - video_to_video - INFO - step: 4
2024-09-17 16:11:32,165 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:32,166 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:43,418 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:43,419 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,695 - video_to_video - INFO - step: 5
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:11:54,719 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:05,980 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,250 - video_to_video - INFO - step: 6
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:17,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:28,542 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:28,543 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,823 - video_to_video - INFO - step: 7
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:39,848 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:39,849 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:12:51,114 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,392 - video_to_video - INFO - step: 8
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:02,422 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:13,689 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:24,969 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,970 - video_to_video - INFO - step: 9
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:24,996 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:36,275 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,554 - video_to_video - INFO - step: 10
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:47,580 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:47,581 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:13:58,843 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,115 - video_to_video - INFO - step: 11
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:10,139 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:21,410 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,685 - video_to_video - INFO - step: 12
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:32,712 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:43,984 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:43,985 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,262 - video_to_video - INFO - step: 13
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:14:55,277 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete input shape: torch.Size([1, 4, 31, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - sharded input shape: torch.Size([1, 4, 8, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - hint shape: torch.Size([1, 4, 16, 154, 248])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - t_hint shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - mask_cond shape: torch.Size([1, 31])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - s_cond shape: torch.Size([1])
2024-09-17 16:15:06,545 - video_to_video.modules.unet_v2v_parallel - INFO - complete f: 31
2024-09-17 16:15:17,860 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,935 - video_to_video - INFO - sampling, finished.
2024-09-17 16:15:17,937 - video_to_video - INFO - sampling, finished.
It seems that you have already finished sampling for diffusion part, so it is because of VAE decoding.
Please go there:
https://github.com/Vchitect/VEnhancer/blob/80ffaa33988c583b129b730ce9d559b114de2d8c/video_to_video/video_to_video_model_parallel.py#L172
For example, you can make some modifications:
self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768
Would those modifications reduce the quality of the output or just slow down processing.
Would those modifications reduce the quality of the output or just slow down processing.
I don't see obvious quality loss, but I've just tested several samples.
@hejingwenhejingwen 🥇 it worked!
https://github.com/user-attachments/assets/27d6707d-285c-4dc4-8874-c59085302308
https://github.com/user-attachments/assets/54df614b-88fc-4512-b972-8210497047c7
It seems that you have already finished sampling for diffusion part, so it is because of VAE decoding. Please go there:
For example, you can make some modifications:
self.frame_chunk_size = 3 self.tile_img_height = 576 self.tile_img_width = 768
Thank you, so I have applied these modifications:
passed max_chunk_len
as parameter to the make_chunks
video_to_video/utils/util.py
def make_chunks(f_num, interp_f_num, chunk_overlap_ratio=0.5, max_chunk_len = 32):
MAX_O_LEN = max_chunk_len * chunk_overlap_ratio
chunk_len = int((max_chunk_len - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
o_len = int((MAX_O_LEN - 1) // (1 + interp_f_num) * (interp_f_num + 1) + 1)
chunk_inds = sliding_windows_1d(f_num, chunk_len, o_len)
return chunk_inds
in order to adjust to 24 by example
video_to_video_model.py
max_chunk_len = 24 # 32
torch.cuda.empty_cache()
chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len)
in the same way as you suggested I passed to tiled_chunked_decode
those params
(video_to_video_model_parallel.py
)
logger.info(f"sampling, finished.")
frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768
gen_video = self.tiled_chunked_decode(gen_vid,
frame_chunk_size=frame_chunk_size,
tile_img_height=tile_img_height,
tile_img_width=tile_img_width)
NOTES.
I have also added pip install accelerate
and to eventually to solve the ImportError: libGL.so.1: cannot open shared object file: No such file or directory
that may happen un Ubuntu I did (this also may happen on CogVideoX on some platforms)
sudo apt-get update && sudo apt-get install ffmpeg libsm6 libxext6 -y
@hejingwenhejingwen other tests. I'm trying the CogVideoX generation now, and the OOM is caused as for your detailed description above to the higher number of frames (defaults to 49 frames)
2024-09-18 09:56:18,427 - video_to_video - INFO - checkpoint_path: ckpts/venhancer_v2.pt
2024-09-18 09:56:30,356 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
2024-09-18 09:56:49,183 - video_to_video - INFO - Load model path ckpts/venhancer_v2.pt, with local status <All keys matched successfully>
2024-09-18 09:56:49,184 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-09-18 09:56:49,966 - video_to_video - INFO - Load model path ckpts/venhancer_v2.pt, with local status <All keys matched successfully>
2024-09-18 09:56:49,967 - video_to_video - INFO - Build diffusion with GaussianDiffusion
and the image frames info:
2024-09-18 09:56:50,232 - video_to_video - INFO - input frames length: 49
2024-09-18 09:56:50,232 - video_to_video - INFO - input fps: 12.0
2024-09-18 09:56:50,248 - video_to_video - INFO - target_fps: 24.0
2024-09-18 09:56:50,503 - video_to_video - INFO - input resolution: (480, 720)
2024-09-18 09:56:50,503 - video_to_video - INFO - target resolution: (1254, 1880)
2024-09-18 09:56:50,503 - video_to_video - INFO - noise augmentation: 250
2024-09-18 09:56:50,503 - video_to_video - INFO - scale s is set to: 4.0
2024-09-18 09:56:50,535 - video_to_video - INFO - video_data shape: torch.Size([97, 3, 1254, 1880])
In this case I get the OOM, I would say in the vae decode step:
video_data_feature = self.vae_encode(video_data)
stacktrace:
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_temporal_decoder.py", line 334, in encode
h = self.encoder(x)
File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/autoencoders/vae.py", line 175, in forward
sample = self.mid_block(sample)
File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 738, in forward
hidden_states = attn(hidden_states, temb=temb)
File "/home/coder/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 490, in forward
return self.processor(
File "/home/coder/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 2216, in __call__
hidden_states = F.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.63 GiB (GPU 2; 22.19 GiB total capacity; 18.40 GiB already allocated; 2.31 GiB free; 19.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 77142) of binary: /usr/bin/python3
In this case I have tried to handle in 4x24GB the 49 frames, the setup was to use a chunking of 4/12 frames per chunk in the make_chunks
- without success:
max_chunk_len = 12
torch.cuda.empty_cache()
chunk_inds = make_chunks(frames_num, interp_f_num, max_chunk_len = max_chunk_len)
While I have ketp the setup for the tiled_chunked_decode
, but the code is not reached here in fact
frame_chunk_size = 3
tile_img_height = 576
tile_img_width = 768
gen_video = self.tiled_chunked_decode(gen_vid,
frame_chunk_size=frame_chunk_size,
tile_img_height=tile_img_height,
tile_img_width=tile_img_width)
So, there is a rule of thumb to do the VRAM requirements calculations by image H, W given FPS, STEPS to set max_chunk_len
in advance?
Thank you!
It is because of VAE encoding, I will add sliced encoding to avoid this.
It is because of VAE encoding, I will add sliced encoding to avoid this.
Great, I was looking infact to this impl for the AutoencoderKLCogVideoX
class. Interestingly authors wrote down vram needs notes:
# Rough memory assessment:
# - In CogVideoX-2B, there are a total of 24 CausalConv3d layers.
# - The biggest intermediate dimensions are: [1, 128, 9, 480, 720].
# - Assume fp16 (2 bytes per value).
# Memory required: 1 * 128 * 9 * 480 * 720 * 24 * 2 / 1024**3 = 17.8 GB
#
# Memory assessment when using tiling:
# - Assume everything as above but now HxW is 240x360 by tiling in half
# Memory required: 1 * 128 * 9 * 240 * 360 * 24 * 2 / 1024**3 = 4.5 GB
Thanks, actually it is okay to process 31 frames, but OOM with 97 frames. So the problem is too many frames. We actually already encode the frames one by one, but it is still OOM. Besides sliced and tiled VAE encoding, you can make chunks of these frames, and process each chunk separately for both VAE encoding and all sampling steps. The existing code can only make chunks for each sampling step. That is, all frames are split before denoising, and then be merged together after denoising.
Thanks, actually it is okay to process 31 frames, but OOM with 97 frames. So the problem is too many frames. We actually already encode the frames one by one, but it is still OOM. Besides sliced and tiled VAE encoding, you can make chunks of these frames, and process each chunk separately for both VAE encoding and all sampling steps. The existing code can only make chunks for each sampling step. That is, all frames are split before denoising, and then be merged together after denoising.
ok thank you very much @hejingwenhejingwen ! For slicing, I was trying the encode part that should be like
@apply_forward_hook
def encode(
self, x: torch.Tensor, return_dict: bool = True
) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
"""
Encode a batch of images into latents.
Args:
x (`torch.Tensor`): Input batch of images.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
tuple.
Returns:
The latent representations of the encoded images. If `return_dict` is True, a
[`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
# LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
#h = self.encoder(x)
if self.use_slicing and x.shape[0] > 1:
encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
h = torch.cat(encoded_slices)
else:
h = self.encoder(x)
moments = self.quant_conv(h)
posterior = DiagonalGaussianDistribution(moments)
if not return_dict:
return (posterior,)
return AutoencoderKLOutput(latent_dist=posterior)
while the decode
part since you have that num_frames
, it seems more complicated, I'm getting some dimensionality errors here if I do as simple as:
@apply_forward_hook
def decode(
self,
z: torch.Tensor,
num_frames: int,
return_dict: bool = True,
) -> Union[DecoderOutput, torch.Tensor]:
"""
Decode a batch of images.
Args:
z (`torch.Tensor`): Input batch of latent vectors.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
Returns:
[`~models.vae.DecoderOutput`] or `tuple`:
If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
batch_size = z.shape[0] // num_frames
image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)
# LP: added slicing - see https://github.com/Vchitect/VEnhancer/issues/20
#decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if self.use_slicing and z.shape[0] > 1:
decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z.split(1)]
decoded = torch.cat(decoded_slices)
else:
decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if not return_dict:
return (decoded,)
return DecoderOutput(sample=decoded)
so I think my error was due to the fact that I have to consider num_frames
in the split...trying this now!
okay the decode
part with slicing is done! 🥇 and it works without issues completing the interpolation correctly.
@apply_forward_hook
def decode(
self,
z: torch.Tensor,
num_frames: int,
return_dict: bool = True,
) -> Union[DecoderOutput, torch.Tensor]:
"""
Decode a batch of images.
Args:
z (`torch.Tensor`): Input batch of latent vectors.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
Returns:
[`~models.vae.DecoderOutput`] or `tuple`:
If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
batch_size = z.shape[0] // num_frames
image_only_indicator = torch.zeros(batch_size, num_frames, dtype=z.dtype, device=z.device)
if self.use_slicing and z.shape[0] > 1:
# Split the tensor based on the number of frames, not into individual slices
z_slices = torch.split(z, num_frames)
decoded_slices = [self.decoder(z_slice, num_frames=num_frames, image_only_indicator=image_only_indicator) for z_slice in z_slices]
decoded = torch.cat(decoded_slices, dim=0) # Concatenate along the batch dimension
else:
decoded = self.decoder(z, num_frames=num_frames, image_only_indicator=image_only_indicator)
if not return_dict:
return (decoded,)
return DecoderOutput(sample=decoded)
@hejingwenhejingwen I realized the encode
was not that good, 'cause I was ignorin batch size was 1, so I took into consideration H and W.
@apply_forward_hook
def encode(
self, x: torch.Tensor, return_dict: bool = True
) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
"""
Encode a batch of images into latents.
Args:
x (`torch.Tensor`): Input batch of images.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain
tuple.
Returns:
The latent representations of the encoded images. If `return_dict` is True, a
[`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is
returned.
"""
# x.shape [1, 3, 1296, 1920] [B, C, W, H]
w_patch_size=128
h_patch_size=128
if self.use_slicing and (x.shape[2] > w_patch_size or x.shape[3] > h_patch_size):
h_slices = []
height_splits = x.split(h_patch_size, dim=2)
for h_slice in height_splits:
width_splits = h_slice.split(w_patch_size, dim=3)
encoded_width_slices = [self.encoder(w_slice) for w_slice in width_splits]
h_slices.append(torch.cat(encoded_width_slices, dim=3))
h = torch.cat(h_slices, dim=2)
else:
h = self.encoder(x)
moments = self.quant_conv(h)
posterior = DiagonalGaussianDistribution(moments)
if not return_dict:
return (posterior,)
return AutoencoderKLOutput(latent_dist=posterior)
While it goes on and finish the video, while keeping the memory, the quality degrades as you can see from here:
https://github.com/user-attachments/assets/7042e217-bcee-4664-8b7f-ee674e71c74d
@hejingwenhejingwen I realized the
encode
was not that good, 'cause I was ignorin batch size was 1, so I took into consideration H and W.@apply_forward_hook def encode( self, x: torch.Tensor, return_dict: bool = True ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]: """ Encode a batch of images into latents. Args: x (`torch.Tensor`): Input batch of images. return_dict (`bool`, *optional*, defaults to `True`): Whether to return a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple. Returns: The latent representations of the encoded images. If `return_dict` is True, a [`~models.autoencoders.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is returned. """ # x.shape [1, 3, 1296, 1920] [B, C, W, H] w_patch_size=128 h_patch_size=128 if self.use_slicing and (x.shape[2] > w_patch_size or x.shape[3] > h_patch_size): h_slices = [] height_splits = x.split(h_patch_size, dim=2) for h_slice in height_splits: width_splits = h_slice.split(w_patch_size, dim=3) encoded_width_slices = [self.encoder(w_slice) for w_slice in width_splits] h_slices.append(torch.cat(encoded_width_slices, dim=3)) h = torch.cat(h_slices, dim=2) else: h = self.encoder(x) moments = self.quant_conv(h) posterior = DiagonalGaussianDistribution(moments) if not return_dict: return (posterior,) return AutoencoderKLOutput(latent_dist=posterior)
While it goes on and finish the video, while keeping the memory, the quality degrades as you can see from here:
iron_man.2.mp4
The patch size should be larger (e.g., 512x512), and has overlap (e.g., 128).
I'm getting a OOM running the
using the
Error
Is it possibile to apply Bf16 quantization? My approach using CogVideoX to run in 24GB is to Tiled VAE Decoding, Sliced VAE Decoding plus CPU offload and running the pipe in BF16. To quantize the model I use
torchao
:Not sure that this can be applied to your model too.