state_dict saving - Githubissues

julianmack commented 4 years ago

Small PR updating the seq_to_seq state dict to include the learning-rate scheduler and optimizer state so that crashed training runs can be resumed w/o disruption.

I've also added a helper function that, given a state_dict filepath, loads the dict into a seq_to_seq instance and returns the training state necessary for the CallbackHandler. It needs this state so that the Saver callback can save with naming convention state_dict_x and the TensorBoardLogger can resume from the same step.

I have not added tests for this functionality here - but I have added one in the lr scheduler PR: https://github.com/MyrtleSoftware/myrtlespeech/pull/22 (which is blocked by the rnnt PR). The reason for this is that the state_dict restoration is relatively simple in this case but much more involved once learning-rate warm-up is added.

julianmack commented 4 years ago

I've followed all suggested changes except for the fact that there is an lr_scheduler in master so I've kept this and treated accordingly.

Putting up for re-review.

julianmack commented 4 years ago

Is the lr_scheduler in master? If it is can this PR be rebased on it (I can't see it!). If it isn't, can the lr_scheduler code be pushed to a different PR?

I think my first comment on the lr_scheduler was unclear @samgd. There is already a lr_scheduler in master but not added by me (added by Giuseppe a while back): https://github.com/MyrtleSoftware/myrtlespeech/blob/d4a0003197f4512eb4ca9b1cda74358456a77990/src/myrtlespeech/builders/task_config.py#L94

master does not have my lr_warmup and polynomial_lr changes: these are in an incoming PR (https://github.com/MyrtleSoftware/myrtlespeech/pull/22) and I agree that they should be considered separately!

I will respond to the rest of the above comments shortly but just flagging this high level point now

samgd commented 4 years ago

I think my first comment on the lr_scheduler was unclear @samgd. There is already a lr_scheduler in master but not added by me (added by Giuseppe a while back):

Ah I see now, thanks. The SeqToSeq class not having an lr_scheduler parameter in init or parameter documented in the docstring was what was throwing me off - I had forgotten it was being set in the builder. I'll make an issue for this as it should be explicit and obvious!

julianmack commented 4 years ago

Responded to changes and added a test.

I've added a check_state_dicts_match function in test.utils.utils. Although this is currently used in just the new test, I've placed it here because I want to use it in another test in my incoming lr_warmup PR. I've also disabled the builders/test_task_config.py test as this functionality is necessarily tested in at the start of the new Saver test.

julianmack commented 4 years ago

Ah I see now, thanks. The SeqToSeq class not having an lr_scheduler parameter in init or parameter documented in the docstring was what was throwing me off - I had forgotten it was being set in the builder. I'll make an issue for this as it should be explicit and obvious!

Also on the above point - in the lr_warmup PR, I've refactored the lr_sheduler creation and added a proper builder w. sphinx docs.

julianmack commented 4 years ago

Responded to comments. A few outstanding discussions.

MyrtleSoftware / myrtlespeech

state_dict saving #19