THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.23k stars 667 forks source link

Using latents in CogVideoXPipeline pipeline #250

Open loretoparisi opened 1 week ago

loretoparisi commented 1 week ago

Feature request / 功能建议

I'm trying to add latents

latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

to the pipeline from encoded frames via the vae encoder example:

encoded_frames = encode_video(model_path, image_path, dtype, device)
 video = pipe(
        prompt=prompt,
        num_videos_per_prompt=num_videos_per_prompt,
        num_inference_steps=num_inference_steps,
        num_frames=num_frames,
        use_dynamic_cfg=True,
        guidance_scale=guidance_scale,
        output_type=output_type,
        generator=torch.Generator(device=device).manual_seed(seed),
        latents=encoded_frames
    )

but I'm facing a dimensionality error

Given groups=1, weight of size [3072, 16, 2, 2], expected input[32, 2, 80, 80] to have 16 channels, but got 2 channels instead

Motivation / 动机

add support to latents parameter in the CogVideoXPipeline pipeline

Your contribution / 您的贡献

Tested VAE Image encoding/decoding https://github.com/THUDM/CogVideo/issues/249

zRzRzRzRzRzRzR commented 1 week ago

Because this model does not exist, we implemented it in text form, with 16 channels. In the future, we will support input with 32 channels

a-r-r-o-w commented 1 week ago

The CogVideoX transformer in diffusers expects latents in the shape [B, F, C, H, W]. The latents parameter is already supported and I've tested it to work.