Closed byungdoh closed 1 year ago
Hi, thanks very much for reporting this! I'll look into it and get back to you as soon as I'm able.
@haileyschoelkopf did you end up looking into this?
I have not yet unfortunately, I'll look at this tomorrow and report back!
Looking around the checkpointing code, it looks to me like we should be saving the 0th checkpoint before we do any weight updates. That's the obvious failure mode that could be causing this.
Continuing to investigate, but upon digging in I'm finding that the info reported by Deepspeed's checkpoint metadata for the NeoX-library checkpoints reports that all is ok! For the EleutherAI/pythia-160m
model, the step0 checkpoint reports global_samples: 0
and global_steps: 0
while the step1 checkpoint reports global_samples: 1024
and global_steps: 1
.
I therefore suspect that this is an artifact of LR warmup starting from 0, causing weights to not yet update on the first step, but am looking into this further. On a scan of a couple parameters in a layer of the 160M model, many (but not all) of the individual floating point parameters printed as the same for step as for step 0, indicating that some parameters will show up as equal to the step 0 checkpoint even after multiple train steps for these super early warmup steps.
I'm therefore pretty confident I did in fact save and upload the correct early checkpoints.
Hope this answers your question @byungdoh !
(Aside:
Note that there was an issue in which "step0" checkpoints in NeoX would be overwritten if a job was resumed by the step being resumed from, but that issue was patched before these models were trained.)
Dear EleutherAI team,
I've noticed that the weights associated with the recently added "step0" and "step1" checkpoints are identical for all pythia models:
This yields something like the following for all eight pythia models:
Would it be possible for you to clarify whether these identical weights correspond to those from "step0" or "step1?" I've noticed that the conditional probabilities calculated using these weights aren't perfectly uniform, which leads me to believe these are actually weights from "step1."
Thanks! Byung-Doh