awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 324 forks source link

Use fault tolerant symlink during training to handle temporary file system inconsistencies. #1093

Closed mjdenkowski closed 1 year ago

mjdenkowski commented 1 year ago

Distributed file systems are not always synchronous. For example, a deleted file may still appear to be present until the file system synchronizes. This can cause FileExistsErrors during training when updating symlinks to the best/current parameter files.

This pull request adds and applies a fault tolerant symlink wrapper that makes multiple attempts before raising an OSError. By default, 6 attempts are made over the course of 63 seconds. This should prevent brief file system inconsistencies from causing errors during parameter file symlink updates.

Unrelated: The latest version of deepspeed is not compatible with our current usage. The requirement is updated to deepspeed==0.6.5.

Pull Request Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.