NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.42k stars 1.4k forks source link

Include format version in distopt checkpoints #1716

Closed timmoon10 closed 1 year ago

timmoon10 commented 1 year ago

https://github.com/NVIDIA/apex/pull/1704 introduced a bug where the distributed optimizer fails when loading old checkpoints in the deprecated v1 format. This PR includes the checkpoint format in the checkpoint. If the distributed optimizer can't find the format when loading a checkpoint, it falls back to v1. This should help with backwards compatibility if we change the format again in the future.