THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
8.76k stars 836 forks source link

Slow loading of model and very delay in image to video #316

Closed Neethan54 closed 20 hours ago

Neethan54 commented 1 month ago

Hii,

I am facing issue with delay in model loading and also the time taken to generate the video from Image. Currently it is taking 8minutes for 8 seconds video, I have 48GB VRAM , but still it is very slow.

Please let me know , if there is any way to solve this.

This is the code im using .

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

from diffusers import (
    CogVideoXPipeline,
    CogVideoXDPMScheduler,
    CogVideoXVideoToVideoPipeline,
    CogVideoXImageToVideoPipeline,
    CogVideoXTransformer3DModel,
)
print('loading I2V model...')
pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=CogVideoXTransformer3DModel.from_pretrained(
        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
    ),
    torch_dtype=torch.bfloat16
).to("cuda")

import random
seed = random.randint(0, 2**8 - 1)
print('loading image..')
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
    )
prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."

negative_prompt ="The video is not of a high quality, it has a low resolution. Strange motion trajectory. Flickering, Blurriness, Face restore.Deformation, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured "
video_pt = pipe_image(
            image=image,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=50,
            num_videos_per_prompt=1,
            use_dynamic_cfg=True,
            output_type="pt",
            guidance_scale=7.0,
            num_frames=49,
            generator=torch.Generator(device="cuda").manual_seed(seed),
        ).frames

batch_video_frames = []
batch_size = video_pt.shape[0]
from diffusers.image_processor import VaeImageProcessor
for batch_idx in range(batch_size):
    pt_image = video_pt[batch_idx]
    pt_image = torch.stack([pt_image[i] for i in range(pt_image.shape[0])])

    image_np = VaeImageProcessor.pt_to_numpy(pt_image)
    image_pil = VaeImageProcessor.numpy_to_pil(image_np)
    batch_video_frames.append(image_pil)
export_to_video(batch_video_frames[0], "videos/output.mp4", fps=8)

Thanks in Advance

zRzRzRzRzRzRzR commented 1 month ago

What GPU are you using, it shouldn't be this slow. Also, the video should be 6 seconds long, can you calculate how long the average step took?

Neethan54 commented 1 month ago

the GPU details are like below ![image](https://github.com/user-attachments/assets/1a92da51-ebdd-42c6-90e8-2d42413ae2d6

Neethan54 commented 1 month ago

yes the video duration is 6 seconds long

zRzRzRzRzRzRzR commented 1 month ago

This speed is clearly incorrect, however, for equipment like yours, I suggest operating according to this plan

image

This will significantly increase the speed

Neethan54 commented 1 month ago

Hi @zRzRzRzRzRzRzR ,

I tried your suggestion, But now it is taking 14 minutes for 6 second Video, Below is the code im using

pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=CogVideoXTransformer3DModel.from_pretrained(
        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
    ),
    torch_dtype=torch.bfloat16
) 

pipe_image.enable_sequential_cpu_offload()

seed = random.randint(0, 2**8 - 1)
prompt='A worker talking to his supervisor in an construction site. High quality, masterpiece, best quality, highres, ultra-detailed, fantastic.'
img_path='images/image_3.png'
from PIL import Image
pil_image = Image.open(img_path).resize(size=(720, 480))
image = load_image(img_path)
negative_prompt ="The video is not of a high quality, it has a low resolution. Strange motion trajectory. Flickering, Blurriness, Face restore.Deformation, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured "
video_pt = pipe_image(
            image=image,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=50,
            num_videos_per_prompt=1,
            use_dynamic_cfg=True,
            output_type="pt",
            guidance_scale=7.0,
            num_frames=49,
            generator=torch.Generator(device="cuda").manual_seed(seed),
        ).frames

Please let me know, if im doing Wrong.

zRzRzRzRzRzRzR commented 1 month ago

This code is correct, I did not see any errors

video_pt = pipe_image(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
num_videos_per_prompt=1,
use_dynamic_cfg=True,
output_type="pt",
guidance scale 7.0
number of frames 49
generator=torch.Generator(device="cuda").manual_seed(seed),
).frames[0]

In this task, did it take 14 minutes? Our speed test only measures this step

zRzRzRzRzRzRzR commented 1 month ago

This is clearly not the level of the A6000, even the T4 is faster than this

Neethan54 commented 1 month ago

yes surprisingly, It is taking 14 minutes .

Neethan54 commented 1 month ago

Hi @zRzRzRzRzRzRzR

How much time it is taking for you to generate 6 second video?

zRzRzRzRzRzRzR commented 1 month ago

I use A100 for 180 seconds with the 5B model

Neethan54 commented 1 month ago

can you please share the code , i want to check in A6000

Shiroha-Key commented 1 month ago

i used 3090 on defulat cli_demo it is taking 12 minutes for 6 second Video image used very few VRAM,Is this the correct speed? @zRzRzRzRzRzRzR

Enchante503 commented 1 month ago

Same for me. On I2V it takes about 10 minutes on an RTX 4090. Only about 3GB of VRAM is used. I added the following code

pipe_image.enable_sequential_cpu_offload()
pipe_image.vae.enable_tiling()

It will take time, but since there is plenty of VRAM available, it seems that performance can be further improved by increasing the resolution and length. Please continue with the development. Also, would it be difficult to generate a video during inference?

If it takes a long time to generate the video, it will be a problem if you cannot predict the result until the video is completed. It would be good if you could see the intermediate results, even if it is at a low resolution and low frame rate.

zRzRzRzRzRzRzR commented 1 month ago

For 4090, you can completely remove

pipe_image.enable_sequential_cpu_offload()

and just move pipe.to("cuda"), should work Currently, there is indeed no way to visualize the intermediate results

Neethan54 commented 1 month ago

@zRzRzRzRzRzRzR

Im using the below torch with cuda version, is this correct?

CUDA 12.1

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

zRzRzRzRzRzRzR commented 1 month ago

This should be fine, as the 2.4.0 version of PyTorch can also be compiled with CUDA 12.1.

Neethan54 commented 1 month ago

@zRzRzRzRzRzRzR
Can you please share the code which you are running in A100

zRzRzRzRzRzRzR commented 1 month ago

https://github.com/THUDM/CogVideo/blob/main/inference/cli_demo.py follow this and remove the pipe_image.enable_sequential_cpu_offload() and use pipe.to("cuda")

Neethan54 commented 1 month ago

@zRzRzRzRzRzRzR I am using the above code and as you can see it is taking 8-9 minutes for 6 seconds.

image

lingyu123-su commented 1 month ago

hello!any progress here?same problem

xijiu9 commented 1 month ago

I think the main reason is that, you should add pipe = pipe.cuda() when copying the code from colab.

Neethan54 commented 1 month ago

Hi @xijiu9 ,

Check this code, https://github.com/THUDM/CogVideo/issues/316#issue-2537904293.

I have added .cuda(), still it was taking so much time in windows OS.