THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.47k stars 892 forks source link

I2V model running in fp16 produces only noise #311

Closed realisticdreamer114514 closed 2 months ago

realisticdreamer114514 commented 2 months ago

System Info / 系統信息

diffusers 0.30.3 on cuda 12.4 & python 3.11 in conda venv of windows

Information / 问题信息

Reproduction / 复现过程

  1. In cli_demo.py, add these optimizations
    pipe.enable_model_cpu_offload()
    pipe.enable_sequential_cpu_offload()
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    and reduce the number of frames to 41 (for 5 seconds)

  2. Run cli_demo.py python cli_demo.py --prompt "A female basketball player is standing in a basketball court, her body leaning forward towards the camera. She is wearing a vibrant blue basketball jersey with the number 3 prominently displayed. The person's head is tilted back, and her hands are clasped together in front of her legs. She is reaching for the viewpoint and waving at it closely. The court beneath her is a rich brown color with green wall in the background and a basketball hoop stands against the wall." --model_path "D:\CogVideoX-5b-I2V" --generate_type "i2v" --output_path ./output.mp4 --image_or_video_path "D:\test\process\2.png" --dtype float16
  3. Video has only colored noise vlcsnap-2024-09-20

Expected behavior / 期待表现

A normal video is produced

kijai commented 2 months ago

The I2V model will generally do exactly that if anything but 49 frames is used.

realisticdreamer114514 commented 2 months ago

The I2V model will generally do exactly that if anything but 49 frames is used.

This is the case, thanks for pointing out.