Closed schopra8 closed 1 month ago
Hi! thanks for your contribution!, great first issue!
Hey @schopra8,
Thanks for reporting the issue. Yes, we are aware of that :) so we are already looking into it. If we do fix it next week, we would make a new release cc @awaelchli
Unfortunately, this is coming as a series of fixes (https://github.com/Lightning-AI/litdata/pull/237) that aren't backward compatible (we won't be able to load old checkpoints as the core logic has changed too much).
np! thanks for the heads up
Also wanted to flag --
I tried running resume on 1 node with N devices. Everything worked for the first couple hundred steps, but then I hit the same error. So, it looks like there is a similar issue in DDP on 1 Node, as well.
Hey @schopra8 Here are the release notes: https://github.com/Lightning-AI/litdata/releases/tag/v0.2.17.
Would you mind trying again with the latest version: 0.2.17 ?
Old checkpoints won't work unfortunately.
Awesome! I'll try in the next 1-2 days and report back my results
Thanks @schopra8.
@tchaton Tested this and works! Closing this issue - since the DDP problem is solved.
We're having another issue https://github.com/Lightning-AI/litdata/issues/263 with resuming training with a new dataset. We want to preserve optimizer states, etc. when we continue training. Any guidance would be much appreciated!
🐛 Bug
We trained a model for several epochs on multiple nodes, and we wanted to continue training with PyTorch Lightning and LitData.
✅ When we resume training on a single device, resumption works as expected. ✅ When we resume training on a single node with N devices, resumption works as expected. ❌ When we resume training on multiple nodes with N devices, resumption fails.
To Reproduce
Run
trainer.fit
with an existing checkpoint with DDP on multiple devices:StackTrace:
Code sample
I've scrubbed my code below --
Expected behavior
Resume training on multiple nodes
Environment
conda
,pip
, source): poetry