Closed davidemaz closed 7 years ago
Hello, thank you for the pull request. It looks a little bit complicated approach, why not just defining another flag, say -model_t7
and set it properly with the checkpoint? It is also easy to dump optimizer along with model so you don't need to find a learning rate.
I can merge Display frequency parametrized
part if you create a separate pull request.
@davidemaz Nice pull. I will be using it to resume work on models on my old GPU. It take too long on it to run up to 50000 iterations. Having the ability to stop and resume is real nice. Thank you for the contribution.
UPDATE:
I get an out of memory error right after the "Optimize" message is displayed:
Setting up texture layer 4 : relu1_2
Setting up texture layer 9 : relu2_2
Setting up texture layer 14 : relu3_2
Setting up content layer 23 : relu4_2
Setting up texture layer 23 : relu4_2
Optimize
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-8567/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
Hope this can be fixed. Would really like to be able to resume.
Thank you Dmitry for your suggestions! I think it's better to close now the pull request. Maybe I will open a new separate pull request for a single feature at a time. Thanks
Quick update on the out of memory. If I reduce the image size to 450 instead of 512 I can restart the training. The strange thing is that the original model was trained at 512, saved, but won't load as it cause an out of memory ;-( Somehow restarting when importing a model is requiring more memory vs just running train.lua.
I added two commits: