Closed bkvie closed 6 years ago
Short answer :
Long answer :
--start-epoch
, which can help you visualize progress better. Also, decreasing learning rate policy will be modified, so that you don't begin full power while only trying to finish your already advanced training.Some advices on 1. and 2. :
Tensorboard is designed in a way that if you have to events files with the same title (it's not file name), the graphs will be concatenated. So you can resume training with --pretrained
--start-epoch
, and after that, manually get the two envets files you got and put them in the same folder.
Now start tensorboard with --logdir
in this very folder, and you should see a nice continuous progess plot, at least for validation values.
Nice thanks a lot! So setting the epoch rate manually is meant for:
tensorboard
Adaptive learning rate!
I am running on KITTI, SINTEL and will forward you results if meaningful!
Sweet !
One more thing that I have not implemented, if you want --pretrained
and --start-epoch
to be the exact same as resuming training, you might have to save and reload also the optimizer state, which keeps 1st and 2nd order momentums (or momenta ? :thinking: ). When not doing that, you might have some effects at the beginning of training (when they are initialized at 0 instead of last training's values), but Adam is supposed to be robust to that.
Anayway, can be a good thing to try if your training resume routine does not act as intended.
How would I reload the optimizer state? For future reference a resume training command would be:
python main.py 'path to train data set' --start-epoch 'start at epoch previously ended' --pretrained 'path to model ... /model_best.pth.tar' -b # -j # -a flownets
where the previous model is saved as model_best.pth.tar' not checkpoint.pth.tar
optimizer state reloading is not implemented here, but it works the same as model loading, with functions tate_dict()and
load_state_dict()`
Also be careful with model_best.pth.tar
if you need to restore optimizer state, it won't necessary be the last save network, and loading a network and optimizer from 2 different epochs is the same as not load optimizer state at all, because all momentums will be off. checkpoint.pth.tar
is just what you want since it's the last saved state of the network, which is not necessarily the best, but is certain to be consistent with optimizer state (provided you also save it at each epoch)