MadryLab / robustness

A library for experimenting with, training and evaluating neural networks, with a focus on adversarial robustness.
MIT License
905 stars 181 forks source link

Issue with resuming training from a checkpoint #41

Closed dapello closed 4 years ago

dapello commented 4 years ago

I noticed that when resuming training from checkpoint using the CLI, the checkpoint weights are loaded, but the current epoch and learning schedule are not resumed.

Unless I'm missing something, it looks like it's because the checkpoint is not passed to train_model; see https://github.com/MadryLab/robustness/blob/6347646ae47120d35f47220eabd507cebbd6c914/robustness/main.py#L57

Seems like an easy fix, as the infrastructure to receive the checkpoint in train_model exists -- when I pass the checkpoint myself, I only get one other error, from https://github.com/MadryLab/robustness/blob/6347646ae47120d35f47220eabd507cebbd6c914/robustness/train.py#L258 where it looks like the checkpoint doesn't have natural or adversarial scores saved.

Thoughts?

andrewilyas commented 4 years ago

Thanks for pointing these out! I think these issues are actually already patched on the develop branch, which we'll merge into master soon.

A brief summary of the workflow in develop:

Let us know if you have thoughts on this

dapello commented 4 years ago

Thanks for the speedy reply! Both of these developments make sense to me. I've patched my local version so it's working fine for me now, and I'll keep an eye for the latest version.

Also, thanks for making this library for public use, it's an excellent research tool!