Closed MannyKayy closed 6 years ago
Thank you, somebody needs to work on this issue to support snapshot
extensions. Essentially, ChainerMN has nothing to dump, so it would be easy.
FYI: Unfortunately, Chainer's snapshot
extension is known to be buggy for now. We sometimes use snapshot_object
extension instead of snapshot
extension, and resume the training by just loading model parameters dumped by snapshot_object
. This approach is feasible with current ChainerMN.
With 1.1.0's checkpointing I think you can do most of what you wanted to do. Also, I believe snapshot extension is stable enough as of today.
I'm closing the issue as the coordinated checkpointing feature of ChainerMN is stable and usable. Please feel free to reopen or create a new one if you have any further issues on this.
Does chainermn currently support some method of creating and resuming from a snapshot object?
I can see the
--resume
argument for the parser in the example files, but chainermn is unable to create a snapshot object when the snapshot extension is called.Thanks