Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
365 stars 42 forks source link

Make `optimize` continue from last checkpoint after crash #137

Closed cgebbe closed 4 months ago

cgebbe commented 5 months ago

When running optimize, my process somehow crashed after 4h (was estimated to take 10h). Now I have to restart it from scratch. Could you add a checkpointing feature such that it automatically continues with the last chunk?

tchaton commented 5 months ago

Hey cgebbe,

Do you know why it crashed ?

cgebbe commented 5 months ago

it was an out of memory error, but I don't have the logs anymore.

Found it a bit strange that it only happened after several hours. Didn't have other tasks running.

tchaton commented 5 months ago

Hey @cgebbe.

We can support this. The writer keeps track of the chunk info there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/writer.py#L253 and we have already some logic to merge the index json file: https://github.com/Lightning-AI/litdata/blob/26bf6b2553a0ec72ab77418988e33e4d639f6f85/src/litdata/streaming/writer.py#L395.

In reality, you could even process your dataset by chunk and just combine them at the end.

Would you be interesting in trying to contribute this feature ?