When training with DeepSpeed, model checkpoints are saved in DeepSpeed's sharded format. Our convert_deepspeed module converts these checkpoints to regular FP32 Sockeye parameter files automatically at the end of training or manually via the sockeye-convert-deepspeed CLI.
Newer versions of DeepSpeed change the way they track checkpoint formats. As a side effect, they do not correctly handle the format we use for Sockeye (ZeRO stage 1). This PR updates the convert_deepspeed module to convert checkpoints correctly without relying on DeepSpeed's automatic format detection.
Pull Request Checklist
[x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
[x] Unit tests pass (pytest)
[x] System tests pass (pytest test/system)
[x] Passed code style checking (./style-check.sh)
[x] You have considered writing a test
[x] Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
[x] Updated CHANGELOG.md
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
When training with DeepSpeed, model checkpoints are saved in DeepSpeed's sharded format. Our
convert_deepspeed
module converts these checkpoints to regular FP32 Sockeye parameter files automatically at the end of training or manually via thesockeye-convert-deepspeed
CLI.Newer versions of DeepSpeed change the way they track checkpoint formats. As a side effect, they do not correctly handle the format we use for Sockeye (ZeRO stage 1). This PR updates the
convert_deepspeed
module to convert checkpoints correctly without relying on DeepSpeed's automatic format detection.Pull Request Checklist
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.