Update DeepSpeed checkpoint conversion to support newer DeepSpeed versions

When training with DeepSpeed, model checkpoints are saved in DeepSpeed's sharded format. Our convert_deepspeed module converts these checkpoints to regular FP32 Sockeye parameter files automatically at the end of training or manually via the sockeye-convert-deepspeed CLI.

Newer versions of DeepSpeed change the way they track checkpoint formats. As a side effect, they do not correctly handle the format we use for Sockeye (ZeRO stage 1). This PR updates the convert_deepspeed module to convert checkpoints correctly without relying on DeepSpeed's automatic format detection.

Pull Request Checklist

[x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]' until you can check this box.
[x] Unit tests pass (pytest)
[x] System tests pass (pytest test/system)
[x] Passed code style checking (./style-check.sh)
[x] You have considered writing a test
[x] Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
[x] Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

awslabs / sockeye

Update DeepSpeed checkpoint conversion to support newer DeepSpeed versions #1071

Pull Request Checklist