Distributed file systems are not always synchronous. For example, a deleted file may still appear to be present until the file system synchronizes. This can cause FileExistsErrors during training when updating symlinks to the best/current parameter files.
This pull request adds and applies a fault tolerant symlink wrapper that makes multiple attempts before raising an OSError. By default, 6 attempts are made over the course of 63 seconds. This should prevent brief file system inconsistencies from causing errors during parameter file symlink updates.
Unrelated: The latest version of deepspeed is not compatible with our current usage. The requirement is updated to deepspeed==0.6.5.
Pull Request Checklist
[x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
[x] Unit tests pass (pytest)
[x] System tests pass (pytest test/system)
[x] Passed code style checking (./style-check.sh)
[x] You have considered writing a test
[x] Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
[x] Updated CHANGELOG.md
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Distributed file systems are not always synchronous. For example, a deleted file may still appear to be present until the file system synchronizes. This can cause
FileExistsError
s during training when updating symlinks to the best/current parameter files.This pull request adds and applies a fault tolerant symlink wrapper that makes multiple attempts before raising an
OSError
. By default, 6 attempts are made over the course of 63 seconds. This should prevent brief file system inconsistencies from causing errors during parameter file symlink updates.Unrelated: The latest version of
deepspeed
is not compatible with our current usage. The requirement is updated todeepspeed==0.6.5
.Pull Request Checklist
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.