MegEngine / MegDiffusion

MegEngine implementation of Diffusion Models.
Apache License 2.0
16 stars 0 forks source link

Handle with saving checkpoint failed #4

Open ChaiByte opened 2 years ago

ChaiByte commented 2 years ago

If the machine is preemptive, it might be scheduled to be preempted (or encounter other situations that cause the machine to go down). If the checkpoint is being saved at the exact moment, the original data will be corrupted. Therefore, it is reasonable to keep multiple backups locally. Considering the disk space occupancy, it is better to support cloud storage, such as supporting the use of AWS s3.