aigc-apps / CogVideoX-Fun

📹 A more flexible CogVideoX that can generate videos at any resolution and creates videos from images.
Apache License 2.0
355 stars 24 forks source link

Seeking advice on pose control #32

Open luyvlei opened 4 days ago

luyvlei commented 4 days ago

Hello, I have also been attempting pose control experiments based on cogvideox recently. My approach is similar to yours, using an additional channel to embed the VAE-compressed pose image into the channel layer. However, in my experiments, I've found that after training for 8000 steps (with a batch size of 8), the image quality deteriorates significantly. Have you encountered similar issues during training? Could this be due to insufficient training? @bubbliiiing

https://github.com/user-attachments/assets/f5cf47c3-83a9-4049-94e3-6ebebf6012c1

bubbliiiing commented 1 day ago

This seems to be not just a decline in quality, but rather that the generated results are faulty. I haven't encountered any situation where the generated results have crashed directly.

Can I see your training parameters?

luyvlei commented 1 day ago

Thank you for your response. I conducted the training with a batch size of 8 and a learning rate of 1e-5 for 12,000 steps. Additionally, I made modifications to the code and fine-tuned the model based on Cogvideo5B-I2V.

bubbliiiing commented 1 day ago

You'd better truncate the first 16 channels of the "conv in" module of the I2V model, because the subsequent channels contain information from I2V.

bubbliiiing commented 1 day ago

The first 16 channels are used for t2v.

luyvlei commented 1 day ago

Excuse me, does "truncate" here mean to truncate the gradients? The default channel for I2V is 32, which I expanded to 48 for control latents following your code.