NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

learning rate error when continue training #963

Closed TtCWH closed 1 month ago

TtCWH commented 1 month ago

Describe the bug I have completed the model training of the first stage. The settings of the first stage are lr=3e-4 and min_lr=3e-5. The settings of the second stage are lr=3e-5 and min_lr=2e-5. Moreover, I enabled the three parameters of --reset-dataloader --override-opt_param-scheduler --reset-iteration. In the output log, lr and min_lr were indeed overridden to the settings of the second stage. However, after loading the checkpoint of the first stage and starting the training, lr changed back to 3e-4.

To Reproduce

  1. Set lr=3e-4 and min_lr=3e-5, and stop the training after obtaining the first checkpoint;
  2. Set the load_checkpoint_path to the path of the first checkpoint, set lr=3e-5 and min_lr=2e-5, enable --reset-dataloader --override-opt_param-scheduler --reset-iteration, and disable warmup, then continue the training;
  3. After starting the training, it is shown that the learning rate of the first step is still 3e-4.

Expected behavior In step 3 above, after starting the training, the learning rate of the first step should be 3e-5.

Stack trace/logs No

Environment (please complete the following information):

Proposed fix Set the args after loading the checkpoint

Additional context

TtCWH commented 1 month ago

I have solved this problem, the arg "no-load-optim" should be set if you don't want your optimizer to load the checkpoint

TissueC commented 2 weeks ago

I think the issue should be re-considered. There should exist a way to override learning rate scheduler but still load optimizer.