PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.25k stars 1k forks source link

Training Loss Anomaly Issue #406

Open RanXu2000 opened 3 weeks ago

RanXu2000 commented 3 weeks ago

Hello, based on your open source v1.1.0 code, I re-trained the first stage with about 3.5m data, and then trained the second stage with your 400w data, which however still maintains the resolution of 65x512x512 in this stage. However, during the training process there was a loss anomaly and the visualisation crashed, I would like to ask if this happened during your training, and where is the reason for this?

image
LinB203 commented 2 weeks ago

Training is unstable, you can add bs or reduce lr. When there is a sudden increase loss, interrupt training and resume from the nearest checkpoint that is normal.