aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
200 stars 83 forks source link

FSDP checkpointing and auto-resume not working due to `FileNotFoundError [incomplete checkpoint]` #190

Closed cfregly closed 8 months ago

cfregly commented 8 months ago

FSDP checkpointing and auto-resume not working due to FileNotFoundError [incomplete checkpoint]

cfregly commented 8 months ago

When EFA hardware failure happens - or the user runs scancel during checkpointing, this issue could happen.

Consider adding some application level code to delete the most recent incomplete checkpoint before trying to resume with the last known-good checkpoint.