Closed vhewes closed 1 year ago
okay.. I know the issue just to unblock you, can you use
self.log('learning rate', self.trainer.optimizers[0].state_dict()['param_groups'][0]['lr'])
This was fixed in https://github.com/Lightning-AI/lightning/pull/18280 See my full reply here on another issue: https://github.com/Lightning-AI/lightning/issues/17296#issuecomment-1726715614
Bug description
i recently adapted a network architecture to a
LightningModule
, and find that when resuming a training in progress from a checkpoint file, the state of theOneCycleLR
scheduler is not properly restored. i've tested with version 1.8.0 and confirmed that the issue persists.the example pasted below will run a full end-to-end training of a toy model and dataset, then train the same model halfway to completion, and finally load the checkpoint file from the halfway-trained model and train it the rest of the way. the example will plot the learning rate in both cases, and demonstrate that in the latter case, the learning rate scheduler's internal state is not restored successfully when loading from the checkpoint file.
How to reproduce the bug
Error messages and logs
No response
Environment
More info
the environment provided above is my local machine, where i constructed the toy, but i also observe the same issue on an Nvidia GPU cluster and in HPC environments, so it is not localised to a specific architecture.
cc @rohitgr7