Closed javirandor closed 5 months ago
Are there any official guidelines for resuming the training from a specific checkpoint?
Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.
But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.
Are there any existing resources to ensure our setup faithfully reproduces your training?
I solved this manually inspecting things. I will try to provide some reproducible instructions soon!
Are there any official guidelines for resuming the training from a specific checkpoint?
Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.
But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.
Are there any existing resources to ensure our setup faithfully reproduces your training?