Closed luyvlei closed 1 month ago
This seems to be not just a decline in quality, but rather that the generated results are faulty. I haven't encountered any situation where the generated results have crashed directly.
Can I see your training parameters?
Thank you for your response. I conducted the training with a batch size of 8 and a learning rate of 1e-5 for 12,000 steps. Additionally, I made modifications to the code and fine-tuned the model based on Cogvideo5B-I2V.
You'd better truncate the first 16 channels of the "conv in" module of the I2V model, because the subsequent channels contain information from I2V.
The first 16 channels are used for t2v.
Excuse me, does "truncate" here mean to truncate the gradients? The default channel for I2V is 32, which I expanded to 48 for control latents following your code.
Sorry, the Cogvideo5B-I2V model does not meet the interception problem I mentioned. Here I intercept the first 16 channels of the 33 channels from CogVideoX-Fun-5B-InP. These 16 channels can be used for text-generated videos and do not contain image-generated video information. Start training the Pose model from here.
@bubbliiiing Thanks for your reply. I found that I concat the reference image latents to all frames instead of the first frame which makes this phenomenon.
Hello, I have also been attempting pose control experiments based on cogvideox recently. My approach is similar to yours, using an additional channel to embed the VAE-compressed pose image into the channel layer. However, in my experiments, I've found that after training for 8000 steps (with a batch size of 8), the image quality deteriorates significantly. Have you encountered similar issues during training? Could this be due to insufficient training? @bubbliiiing
https://github.com/user-attachments/assets/f5cf47c3-83a9-4049-94e3-6ebebf6012c1