diux-dev / cluster

train on AWS
75 stars 15 forks source link

distributed checkpoint saving #13

Closed yaroslavvb closed 6 years ago

yaroslavvb commented 6 years ago

Distributed checkpoint saving is turned off, why? @bearpelican https://github.com/diux-dev/cluster/blob/ae3b12f736c7fa2b27b2e89a2953197a0940b189/pytorch/training/train_imagenet_nv.py#L426

bearpelican commented 6 years ago

@yaroslavvb Oops that comment is not very clear. This is actually to get both distributed and non-distributed checkpointing to work.

We only want to save the underlying model and weights. If you save the distributed top-layer, you'll need to do a certain hack to get it to work - re-initialize the distributed module, but even then there's another bug which causes an out of memory error https://github.com/diux-dev/cluster/commit/bec11aae91463a4169263b66194d9a65bc0f83da#diff-75be79d6640e3bb96c7683235830b933R229

I believe the apex checkpointing code doesn't even work with distributed right now. So this is something we've had to fix on our own