Open AvisP opened 2 days ago
Please read the readme carefully. I'm not sure if you changed your frame count to 81, and I noticed that you only used 20 steps instead of the official 50 steps, which has a significant impact on the results. A resolution of 720 * 480 is acceptable.
720 480 is suitable for CogVideoX1.5-5B-I2V, if it is T2V, it should be 1260 768.
Thanks for your comments. so I kept the height as 480 and width as 720 and this is what I got
Guidance scale : 6.0 and Inference step : 50
https://github.com/user-attachments/assets/6c657695-9a6b-461d-9336-248daa48ef93
Guidance scale : 6.0 and Inference step : 20
https://github.com/user-attachments/assets/a9d115e5-34eb-4468-964c-ad472a15eaab
It seems to be kind of doing what is described in the prompt and changes the camera angle and zooms in. but don't see noticeable improvement with number of inference steps. But if I am missing out on something please let me know. Following is the code for reference
pipe = CogVideoXImageToVideoPipeline.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
se_dynamic_cfg_flag = True
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video_generate = pipe(
height=480,
width=720,
prompt=prompt,
image=image,
num_videos_per_prompt=1, # Number of videos to generate per prompt
num_inference_steps=50, # Number of inference steps
num_frames=81, # Number of frames to generate
use_dynamic_cfg=use_dynamic_cfg_flag, # This id used for DPM scheduler, for DDIM scheduler, it should be False
guidance_scale=6.0,
generator=torch.Generator().manual_seed(42), # Set the seed for reproducibility
).frames[0]
output_file_path = Path.joinpath(output_path, 'output.mp4')
export_to_video(video_generate, output_file_path, fps=8)
Change the frame rate from fps=8 to fps=16 in the export_to_video function
Nevertheless, in this scenario and with this prompt word, the effect is indeed not arbitrary, and this issue is recorded as a bad case. Will this problem occur with any replacement of the prompt word?
I was wondering if there is any special way to improve the quality of video generated form static images. I followed the suggestion provided on the example to prompt with LLMs. For the static image shown below I provided it with a text prompt that I got from chatgpt "A dense canopy of vibrant green leaves gently sways in the breeze, their edges fluttering and rustling softly. The wind moves through the branches, causing the leaves to tremble and shimmer with a life of their own. As the wind picks up, the rustling grows louder, filling the air with a natural melody. The sunlight filters through the leaves, casting playful shadows on the ground below, while the occasional gust of wind causes a few leaves to break free, twirling and spiraling before landing softly."
The resulting video I got with inference_step at 20, guidance_scale at 6, num_inference_steps at 49 is the following
https://github.com/user-attachments/assets/c11d12bf-339d-4f53-88ac-93fb0d4f136e
Can you provide some suggestion as to how to improve the quality of the video. I can upscale the video and improve the resolution but to improve the dynamics of the image is what I am interested in. Thanks!