Open fuchao01 opened 1 month ago
We collected approximately 1.2 million high-quality data for the training of CogVideoX-Fun. During the training, we resized the videos based on different token lengths. The entire training process is divided into three phases, with each phase corresponding to 13312 (for 512x512x49 videos), 29952 (for 768x768x49 videos), and 53248 (for 1024x1024x49 videos).
Taking CogVideoX-Fun-2B as an example: In the 13312 phase, the batch size is 128 with 7k training steps. In the 29952 phase, the batch size is 256 with 6.5k training steps. In the 53248 phase, the batch size is 128 with 5k training steps.
Excellent work, could you please share some details about the training and how much training data was used?