ClementPinard / FlowNetPytorch

Pytorch implementation of FlowNet by Dosovitskiy et al.
MIT License
840 stars 205 forks source link

Resume training/Fine tune #31

Closed bkvie closed 6 years ago

bkvie commented 6 years ago
  1. Using --pretrained to continue training on previously trained data set seems to overwrite results?
  2. Is it meant to be used with --start-epoch to manually specify epoch to be restarted on?
  3. Is is possible to fine tune weights on data sets? I.e pretrain on flying cars and then use these weights to fine tune on KITTI (similar to ImNet initialization of deep networks)?
ClementPinard commented 6 years ago

Short answer :

  1. No
  2. Not necessarily
  3. Yes

Long answer :

  1. What do you mean by overwrite results ? It does start a new tensorboard graph, but it should not overwrite your pretrained network, nor should it behaved as if it was not trained at all
  2. As said above, a new tensorboard session will still be opened, but x-values will begin at --start-epoch, which can help you visualize progress better. Also, decreasing learning rate policy will be modified, so that you don't begin full power while only trying to finish your already advanced training.
  3. Yes ! You can try Flying chairs -> KITTI, works like a charm ! Maybe I should mention a little How To on the README for it, along with results I got (they're good !)

Some advices on 1. and 2. : Tensorboard is designed in a way that if you have to events files with the same title (it's not file name), the graphs will be concatenated. So you can resume training with --pretrained --start-epoch, and after that, manually get the two envets files you got and put them in the same folder. Now start tensorboard with --logdir in this very folder, and you should see a nice continuous progess plot, at least for validation values.

bkvie commented 6 years ago

Nice thanks a lot! So setting the epoch rate manually is meant for:

I am running on KITTI, SINTEL and will forward you results if meaningful!

ClementPinard commented 6 years ago

Sweet !

One more thing that I have not implemented, if you want --pretrained and --start-epoch to be the exact same as resuming training, you might have to save and reload also the optimizer state, which keeps 1st and 2nd order momentums (or momenta ? :thinking: ). When not doing that, you might have some effects at the beginning of training (when they are initialized at 0 instead of last training's values), but Adam is supposed to be robust to that.

Anayway, can be a good thing to try if your training resume routine does not act as intended.

bkvie commented 6 years ago

How would I reload the optimizer state? For future reference a resume training command would be:

python main.py 'path to train data set' --start-epoch 'start at epoch previously ended' --pretrained 'path to model ... /model_best.pth.tar' -b # -j # -a flownets

where the previous model is saved as model_best.pth.tar' not checkpoint.pth.tar

ClementPinard commented 6 years ago

optimizer state reloading is not implemented here, but it works the same as model loading, with functions tate_dict()andload_state_dict()`

Also be careful with model_best.pth.tar if you need to restore optimizer state, it won't necessary be the last save network, and loading a network and optimizer from 2 different epochs is the same as not load optimizer state at all, because all momentums will be off. checkpoint.pth.tar is just what you want since it's the last saved state of the network, which is not necessarily the best, but is certain to be consistent with optimizer state (provided you also save it at each epoch)