Issue with resuming training from a checkpoint

dapello commented 4 years ago

I noticed that when resuming training from checkpoint using the CLI, the checkpoint weights are loaded, but the current epoch and learning schedule are not resumed.

Unless I'm missing something, it looks like it's because the checkpoint is not passed to train_model; see https://github.com/MadryLab/robustness/blob/6347646ae47120d35f47220eabd507cebbd6c914/robustness/main.py#L57

Seems like an easy fix, as the infrastructure to receive the checkpoint in train_model exists -- when I pass the checkpoint myself, I only get one other error, from https://github.com/MadryLab/robustness/blob/6347646ae47120d35f47220eabd507cebbd6c914/robustness/train.py#L258 where it looks like the checkpoint doesn't have natural or adversarial scores saved.

Thoughts?

andrewilyas commented 4 years ago

Thanks for pointing these out! I think these issues are actually already patched on the develop branch, which we'll merge into master soon.

A brief summary of the workflow in develop:

There is a resume_optimizer argument, if it is False then things proceed as in master, if it is true it passes the checkpoint on to train (this argument is because for some cases like fine-tuning on other datasets, you might want to only resume the model and not the optimizer).
Right now if train() can't find a best_prec1 key (as is the case for some old checkpoints, like you noticed), it just sets best_prec1 = 0 so that the model is immediately overwritten. We're planning on changing this so that if there's no best_prec1 key the model is immediately re-evaluated (before any training happens), in case it degrades from epoch 0.

Let us know if you have thoughts on this

dapello commented 4 years ago

Thanks for the speedy reply! Both of these developments make sense to me. I've patched my local version so it's working fine for me now, and I'll keep an eye for the latest version.

Also, thanks for making this library for public use, it's an excellent research tool!

MadryLab / robustness

Issue with resuming training from a checkpoint #41