NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.36k forks source link

Include format version in distopt checkpoints #1716

Closed timmoon10 closed 10 months ago

timmoon10 commented 11 months ago

https://github.com/NVIDIA/apex/pull/1704 introduced a bug where the distributed optimizer fails when loading old checkpoints in the deprecated v1 format. This PR includes the checkpoint format in the checkpoint. If the distributed optimizer can't find the format when loading a checkpoint, it falls back to v1. This should help with backwards compatibility if we change the format again in the future.