Vchitect / VEnhancer

Official codes of VEnhancer: Generative Space-Time Enhancement for Video Generation
https://vchitect.github.io/VEnhancer-project/
374 stars 22 forks source link

Multi-GPU Inference Support or Video Splitting for Long Video Processing #6

Open zRzRzRzRzRzRzR opened 1 month ago

zRzRzRzRzRzRzR commented 1 month ago

We are working with videos that range from 6 to 10 seconds in length, which obviously leads to Out Of Memory (OOM) errors during processing. We have access to high-performance hardware, such as multiple A100 GPUs.

  1. Is there a way to implement multi-GPU inference to handle these longer videos? If so, could you provide guidance on how to set it up?
  2. If multi-GPU inference is not supported, is there a method to split the video into smaller segments for processing? We are concerned that splitting the video might degrade the final output quality. Could you suggest the best practices to minimize quality loss in this scenario?
hejingwenhejingwen commented 4 weeks ago

I am working on processing arbitrary long videos. The update will be released in two days.

hejingwenhejingwen commented 4 weeks ago

Hi, please check the results here: https://github.com/Vchitect/VEnhancer/issues/8

zRzRzRzRzRzRzR commented 4 weeks ago

Sure, Check this asap, thks!

zRzRzRzRzRzRzR commented 4 weeks ago

is any ckpt changed that I found need to load laion2b_s32b_b79k model

hejingwenhejingwen commented 4 weeks ago

The ckpts are the same as previous ones. laion2b_s32b_b79k model is: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin

zRzRzRzRzRzRzR commented 4 weeks ago
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-08-20 13:25:17,553 - video_to_video - INFO - checkpoint_path: ./ckpts/venhancer_paper.pt
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/open_clip/factory.py:88: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=map_location)
2024-08-20 13:25:37,486 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  load_dict = torch.load(cfg.model_path, map_location='cpu')
2024-08-20 13:25:55,391 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-08-20 13:25:55,392 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-08-20 13:26:16,092 - video_to_video - INFO - input video path: inputs/000000.mp4
2024-08-20 13:26:16,093 - video_to_video - INFO - text: Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere
2024-08-20 13:26:16,156 - video_to_video - INFO - input frames length: 49
2024-08-20 13:26:16,156 - video_to_video - INFO - input fps: 8.0
2024-08-20 13:26:16,156 - video_to_video - INFO - target_fps: 24.0
2024-08-20 13:26:16,311 - video_to_video - INFO - input resolution: (480, 720)
2024-08-20 13:26:16,312 - video_to_video - INFO - target resolution: (1320, 1982)
2024-08-20 13:26:16,312 - video_to_video - INFO - noise augmentation: 250
2024-08-20 13:26:16,312 - video_to_video - INFO - scale s is set to: 8
2024-08-20 13:26:16,399 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:108: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=True):
2024-08-20 13:27:19,605 - video_to_video - INFO - step: 0
2024-08-20 13:30:12,020 - video_to_video - INFO - step: 1
2024-08-20 13:33:04,956 - video_to_video - INFO - step: 2
2024-08-20 13:35:58,691 - video_to_video - INFO - step: 3
2024-08-20 13:38:51,254 - video_to_video - INFO - step: 4
2024-08-20 13:41:44,150 - video_to_video - INFO - step: 5
2024-08-20 13:44:37,017 - video_to_video - INFO - step: 6
2024-08-20 13:47:30,037 - video_to_video - INFO - step: 7
2024-08-20 13:50:22,838 - video_to_video - INFO - step: 8
2024-08-20 13:53:15,844 - video_to_video - INFO - step: 9
2024-08-20 13:56:08,657 - video_to_video - INFO - step: 10
2024-08-20 13:59:01,648 - video_to_video - INFO - step: 11
2024-08-20 14:01:54,541 - video_to_video - INFO - step: 12
2024-08-20 14:04:47,488 - video_to_video - INFO - step: 13
2024-08-20 14:10:13,637 - video_to_video - INFO - sampling, finished.

SO slow, is it normal ,running in single A100

hejingwenhejingwen commented 4 weeks ago

So sad it is normal. It makes senses because you are processing high-resolution and high-frame-rate videos. Multiple gpu inference may help, but don't expect too much:(

zRzRzRzRzRzRzR commented 4 weeks ago

So sad it is normal. It makes senses because you are processing high-resolution and high-frame-rate videos. Multiple gpu inference may help, but don't expect too much:(

how to configure, did not saw it in readme and btw, It’s absolutely necessary to set the prompt to be the same as the one used to generate the video in CogVideoX, right?

hejingwenhejingwen commented 4 weeks ago

The Multiple gpu inference is not supported right now, but we are working on it. VEnhancer is trained with short captions mostly, not sure it can understand long captions. It may generate unpleasing textures(not sure) if you provide too many words. More importantly, the max words is 77 in our used clip.

zRzRzRzRzRzRzR commented 4 weeks ago

Oh, that’s an issue because CogVideoX supports long text, typically exceeding 77 words, usually around 150-220 words.

I’d like to know how to reproduce your rendered video. How should the prompt be written, given that the original video prompt is longer than 77 words?

hejingwenhejingwen commented 4 weeks ago

I only adopt the first sentence. For example: The camera follows behinds a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope.

The results that I present in README is not processed by the released VEnhancer checkpoint. The released VEnhancer has powerful generative ability and is more suitable for lower-quality and lower-resolution AIGC videos. But CogVideoX can already produce good videos, so I use another checkpoint for just enhancing temporal consistency and removing unpleasing textures.

zRzRzRzRzRzRzR commented 4 weeks ago

So, with the released version, it’s possible to reproduce the results if only use the first sentence of the prompt? I’m currently writing the quick start guide for this and preparing to post it in the CogVideoX community. I need to confirm this issue :)

hejingwenhejingwen commented 4 weeks ago

The released ckpt; up_scale=3; noise_aug=200; target_fps=24, prompt="A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea"

It will produce results like this:

https://github.com/user-attachments/assets/f3289e4f-9ba3-48a4-985b-f0c6eff3dfea

If you are happy with this, you can use the above parameters. Actually, up_scale can be set to 2, if you cannot wait. But the quality will degrade. Besides, fps>=16 is already very smooth, so you can also decline the target_fps to 16. noise_aug controls the refinement strength, it depends on users' preference.

zRzRzRzRzRzRzR commented 4 weeks ago

https://github.com/THUDM/CogVideo/pull/143/files#diff-9e657cda0980a4aee4b86550d3640347df4f55f3ac3a827132471681fdc7f52c

Is this guide work(I tested it and work for me)? If OK I will push it

hejingwenhejingwen commented 4 weeks ago

https://github.com/THUDM/CogVideo/pull/143/files#diff-9e657cda0980a4aee4b86550d3640347df4f55f3ac3a827132471681fdc7f52c

Is this guide work(I tested it and work for me)? If OK I will push it

-up_scale is recommend to be set to 3,4, or 2 if the resolution of input video is already high. The target resolution is limited to be around 2k and below. -noise_aug value depends on the input video quality. Lower quality needs higher noise levels, which corresponds to stronger refinement. 250~300 is for very low-quality videos. good videos: <= 200. -if you want fewer steps, please change solver_mode to "normal" first, then decline the number of steps. "fast" solver_mode has fixed steps (15).

These are my comments. Thanks for your work!