Resuming from last checkpoint

DmitryUlyanov / texture_nets

Code for "Texture Networks: Feed-forward Synthesis of Textures and Stylized Images" paper.

Apache License 2.0

1.22k stars 218 forks source link

Resuming from last checkpoint #51

Closed davidemaz closed 7 years ago

davidemaz commented 7 years ago

I added two commits:

Display frequency parametrized (2 lines of code)
Added the feature "resuming from the last checkpoint" I think this last feature could be useful for carrying out experiments. The code is made to find and load the last checkpoint and then start from there. It also lowers down the learning rate to the right value. It is not really the same as training the model without interruption because it doesn't save the adam optimization state, but still it is useful in most cases.

DmitryUlyanov commented 7 years ago

Hello, thank you for the pull request. It looks a little bit complicated approach, why not just defining another flag, say -model_t7 and set it properly with the checkpoint? It is also easy to dump optimizer along with model so you don't need to find a learning rate.

I can merge Display frequency parametrized part if you create a separate pull request.

bmaltais commented 7 years ago

@davidemaz Nice pull. I will be using it to resume work on models on my old GPU. It take too long on it to run up to 50000 iterations. Having the ability to stop and resume is real nice. Thank you for the contribution.

UPDATE:

I get an out of memory error right after the "Optimize" message is displayed:

Setting up texture layer 4 : relu1_2 Setting up texture layer 9 : relu2_2 Setting up texture layer 14 : relu3_2 Setting up content layer 23 : relu4_2 Setting up texture layer 23 : relu4_2 Optimize
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-8567/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory

Hope this can be fixed. Would really like to be able to resume.

davidemaz commented 7 years ago

Thank you Dmitry for your suggestions! I think it's better to close now the pull request. Maybe I will open a new separate pull request for a single feature at a time. Thanks

bmaltais commented 7 years ago

Quick update on the out of memory. If I reduce the image size to 450 instead of 512 I can restart the training. The strange thing is that the original model was trained at 512, saved, but won't load as it cause an out of memory ;-( Somehow restarting when importing a model is requiring more memory vs just running train.lua.