Closed QizhiPei closed 1 year ago
@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb
@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb
Thanks for your kindly help!
Sorry for the late reply!
@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like: accelerator.load_state(path_to_checkpoint)
in the main.py before the train
call.
Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.
Let me know if it works!
Sorry for the late reply!
@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like:
accelerator.load_state(path_to_checkpoint)
in the main.py before thetrain
call.Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.
Let me know if it works!
Thanks for you suggestions!
I successfully load the saved checkpoints. However, it seems that the accelerator.load_state
will also load the scheduler state. Could you kindly explain the detailed meaning of "adjust the LR scheduler appropriately" ?
Thanks again!
Sorry for the possible confusion. I see two reasons to resume the pre-training process.
accelerator.load_state
which loads LR scheduler and all the states correctly and resumes the pre-training until it finished after 2^16 steps.Good luck and ask if you have any further questions
Hi:
Thank you for your good work!
I want to know if the nanoT5 supports resuming the training process from a saved checkpoint including model(
pytorch_model.bin
), optimizer state(optimizer.bin
) and lr scheduler(scheduler.bin
)? I would really appreciate it if you could give me a simple example for it.Thanks a lot!