Is there existing code to resume training from specific checkpoint?

EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Apache License 2.0

2.16k stars 156 forks source link

Closed javirandor closed 5 months ago

javirandor commented 5 months ago

Are there any official guidelines for resuming the training from a specific checkpoint?

Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.

But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.

Are there any existing resources to ensure our setup faithfully reproduces your training?

javirandor commented 5 months ago

I solved this manually inspecting things. I will try to provide some reproducible instructions soon!