Support Style reference, Seed image and Control parameters.

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

8.4k stars 803 forks source link

Support Style reference, Seed image and Control parameters. #248

Open loretoparisi opened 1 month ago

loretoparisi commented 1 month ago

Feature request / 功能建议

Any plan to add support to style reference, seed image and control parameters?

Motivation / 动机

This feature has been speculated in this article:

Input Processing:
- Text Input: The model accepts textual descriptions as input, likely utilizing advanced tokenization and embedding techniques to convert text into a format suitable for the neural network. This might involve a vocabulary size of 30,000 to 50,000 tokens, with each token embedded into a high-dimensional space (e.g., 768 or 1024 dimensions).
- Potential Additional Inputs: While not confirmed, the model might also accept additional inputs such as style references, seed images, or control parameters to guide the video generation process.

Other T2V and T2I models have extensive support to style, image and control parameters yet.

Your contribution / 您的贡献

zRzRzRzRzRzRzR commented 1 month ago

You can set the seed, but there's no way to set the style because the training process hasn't been enhanced in this aspect. We will try to improve this in the future. For seed settings, please refer to cli_demo.

loretoparisi commented 1 month ago

You can set the seed, but there's no way to set the style because the training process hasn't been enhanced in this aspect. We will try to improve this in the future. For seed settings, please refer to cli_demo.

Thank you, in the cli_demo I only see the generator to seed the rnd, but not a seed image

video = pipe(
        prompt=prompt,
        num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
        num_inference_steps=num_inference_steps,  # Number of inference steps
        num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.31.0` and after.
        use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
        guidance_scale=guidance_scale,  # Guidance scale for classifier-free guidance, can set to 7 for DPM scheduler
        generator=torch.Generator().manual_seed(42),  # Set the seed for reproducibility
    ).frames[0]

zRzRzRzRzRzRzR commented 1 month ago

Oh, the image is indeed generated directly from noise. We didn't write the content to control this part, so the impact should be very, very small

loretoparisi commented 1 month ago

the

ah correct, so in any case you mean you would need to add a image parameter as for the Image2Image Diffusers's pipeline here https://huggingface.co/docs/diffusers/v0.30.2/en/api/pipelines/auto_pipeline#diffusers.AutoPipelineForImage2Image