awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

Update DeepSpeed checkpoint conversion to support newer DeepSpeed versions #1071

Closed mjdenkowski closed 2 years ago

mjdenkowski commented 2 years ago

When training with DeepSpeed, model checkpoints are saved in DeepSpeed's sharded format. Our convert_deepspeed module converts these checkpoints to regular FP32 Sockeye parameter files automatically at the end of training or manually via the sockeye-convert-deepspeed CLI.

Newer versions of DeepSpeed change the way they track checkpoint formats. As a side effect, they do not correctly handle the format we use for Sockeye (ZeRO stage 1). This PR updates the convert_deepspeed module to convert checkpoints correctly without relying on DeepSpeed's automatic format detection.

Pull Request Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.