Resume the pre-training process

PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models

Apache License 2.0

970 stars 74 forks source link

Resume the pre-training process #7

Closed QizhiPei closed 1 year ago

QizhiPei commented 1 year ago

Hi:

Thank you for your good work!

I want to know if the nanoT5 supports resuming the training process from a saved checkpoint including model(pytorch_model.bin), optimizer state(optimizer.bin) and lr scheduler(scheduler.bin)? I would really appreciate it if you could give me a simple example for it.

Thanks a lot!

iSevenDays commented 1 year ago

@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb

QizhiPei commented 1 year ago

@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb

Thanks for your kindly help!

PiotrNawrot commented 1 year ago

Sorry for the late reply!

@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like: accelerator.load_state(path_to_checkpoint) in the main.py before the train call.

Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.

Let me know if it works!

QizhiPei commented 1 year ago

Sorry for the late reply!

@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like: accelerator.load_state(path_to_checkpoint) in the main.py before the train call.

Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.

Let me know if it works!

Thanks for you suggestions!

I successfully load the saved checkpoints. However, it seems that the accelerator.load_state will also load the scheduler state. Could you kindly explain the detailed meaning of "adjust the LR scheduler appropriately" ?

Thanks again!

PiotrNawrot commented 1 year ago

Sorry for the possible confusion. I see two reasons to resume the pre-training process.

You have a time limit for your experiment and you can't fit the entire pre-training into the time limit and need to split it into > 1 jobs. Then resuming the pre-training is just as easy as doing accelerator.load_state which loads LR scheduler and all the states correctly and resumes the pre-training until it finished after 2^16 steps.
You have finished pre-training for the desired number of steps (2^16), but then you want to try to actually take the checkpoint after 2^16 steps and continue pre-training it for the next 2^16 steps. To do so you need to update the LR scheduler with the new desired number of steps (2^17) so that you properly decay your LR during the second half of your training.

Good luck and ask if you have any further questions