learning rate error when continue training

TtCWH commented 1 month ago

Describe the bug I have completed the model training of the first stage. The settings of the first stage are lr=3e-4 and min_lr=3e-5. The settings of the second stage are lr=3e-5 and min_lr=2e-5. Moreover, I enabled the three parameters of --reset-dataloader --override-opt_param-scheduler --reset-iteration. In the output log, lr and min_lr were indeed overridden to the settings of the second stage. However, after loading the checkpoint of the first stage and starting the training, lr changed back to 3e-4.

To Reproduce

Set lr=3e-4 and min_lr=3e-5, and stop the training after obtaining the first checkpoint;
Set the load_checkpoint_path to the path of the first checkpoint, set lr=3e-5 and min_lr=2e-5, enable --reset-dataloader --override-opt_param-scheduler --reset-iteration, and disable warmup, then continue the training;
After starting the training, it is shown that the learning rate of the first step is still 3e-4.

Expected behavior In step 3 above, after starting the training, the learning rate of the first step should be 3e-5.

Stack trace/logs No

Environment (please complete the following information):

Megatron-LM commit ID: latest
PyTorch version: 2.3.0a0+40ec155e58.nv24.03
CUDA version: 12.4
NCCL version:

Proposed fix Set the args after loading the checkpoint

Additional context

TtCWH commented 1 month ago

I have solved this problem, the arg "no-load-optim" should be set if you don't want your optimizer to load the checkpoint

TissueC commented 2 weeks ago

I think the issue should be re-considered. There should exist a way to override learning rate scheduler but still load optimizer.

NVIDIA / Megatron-LM

learning rate error when continue training #963