aigc-apps / CogVideoX-Fun

📹 A more flexible CogVideoX that can generate videos at any resolution and creates videos from images.
Apache License 2.0
486 stars 32 forks source link

Seeking advice on pose control #32

Closed luyvlei closed 1 month ago

luyvlei commented 1 month ago

Hello, I have also been attempting pose control experiments based on cogvideox recently. My approach is similar to yours, using an additional channel to embed the VAE-compressed pose image into the channel layer. However, in my experiments, I've found that after training for 8000 steps (with a batch size of 8), the image quality deteriorates significantly. Have you encountered similar issues during training? Could this be due to insufficient training? @bubbliiiing

https://github.com/user-attachments/assets/f5cf47c3-83a9-4049-94e3-6ebebf6012c1

bubbliiiing commented 1 month ago

This seems to be not just a decline in quality, but rather that the generated results are faulty. I haven't encountered any situation where the generated results have crashed directly.

Can I see your training parameters?

luyvlei commented 1 month ago

Thank you for your response. I conducted the training with a batch size of 8 and a learning rate of 1e-5 for 12,000 steps. Additionally, I made modifications to the code and fine-tuned the model based on Cogvideo5B-I2V.

bubbliiiing commented 1 month ago

You'd better truncate the first 16 channels of the "conv in" module of the I2V model, because the subsequent channels contain information from I2V.

bubbliiiing commented 1 month ago

The first 16 channels are used for t2v.

luyvlei commented 1 month ago

Excuse me, does "truncate" here mean to truncate the gradients? The default channel for I2V is 32, which I expanded to 48 for control latents following your code.

bubbliiiing commented 1 month ago

Sorry, the Cogvideo5B-I2V model does not meet the interception problem I mentioned. Here I intercept the first 16 channels of the 33 channels from CogVideoX-Fun-5B-InP. These 16 channels can be used for text-generated videos and do not contain image-generated video information. Start training the Pose model from here.

luyvlei commented 1 month ago

@bubbliiiing Thanks for your reply. I found that I concat the reference image latents to all frames instead of the first frame which makes this phenomenon.