Open Olivia-MM7 opened 9 months ago
Hi, In the paper, authors report a training of 700 000 iterations with a batch size 16 for the latent diffusion training: "For latent diffusion training, we train models from scratch using the same optimizer but with a learning rate of 1e-6 and a batch size of 16 for 700,000 iterations." Given that an epoch with batch size 16 is 2115 iterations, we have 700 000 / 2115 = 331 epochs if you want to reproduce authors results. For a comparison, in my hardware (2x V100 GPUs), one epoch takes in average 34min so 331 epochs are roughly 8 days. Hope this helps.
I have been training the model for more than a day, and it has reached the 27th epoch without stopping. Does the training automatically stop when using trainer.py? If it doesn't stop automatically, how many epochs should I train it for? Thank you!