chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer
https://chainer.org
MIT License
207 stars 57 forks source link

Creating trainer snapshots? #76

Closed MannyKayy closed 6 years ago

MannyKayy commented 7 years ago

Does chainermn currently support some method of creating and resuming from a snapshot object?

I can see the --resume argument for the parser in the example files, but chainermn is unable to create a snapshot object when the snapshot extension is called.

Thanks

iwiwi commented 7 years ago

Thank you, somebody needs to work on this issue to support snapshot extensions. Essentially, ChainerMN has nothing to dump, so it would be easy.

FYI: Unfortunately, Chainer's snapshot extension is known to be buggy for now. We sometimes use snapshot_object extension instead of snapshot extension, and resume the training by just loading model parameters dumped by snapshot_object. This approach is feasible with current ChainerMN.

kuenishi commented 6 years ago

With 1.1.0's checkpointing I think you can do most of what you wanted to do. Also, I believe snapshot extension is stable enough as of today.

keisukefukuda commented 6 years ago

I'm closing the issue as the coordinated checkpointing feature of ChainerMN is stable and usable. Please feel free to reopen or create a new one if you have any further issues on this.