Closed cfregly closed 8 months ago
When EFA hardware failure happens - or the user runs scancel
during checkpointing, this issue could happen.
Consider adding some application level code to delete the most recent incomplete checkpoint before trying to resume with the last known-good checkpoint.
FSDP checkpointing and auto-resume not working due to
FileNotFoundError [incomplete checkpoint]