awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

Add Support for DeepSpeed ZeRO Stage 1 #1059

Closed mjdenkowski closed 2 years ago

mjdenkowski commented 2 years ago

This PR adds training support for DeepSpeed with ZeRO stage 1:

Sockeye's DeepSpeed mode does the following:

DeepSpeed saves checkpoints using its own format. The new convert_deepspeed module recovers standard Sockeye parameter files from these checkpoints. At the end of training, checkpoints are automatically converted. Checkpoints can also be converted manually using the CLI.

DeepSpeed is an optional dependency (like Apex) and these changes are backward compatible. Training results with DeepSpeed ZeRO stage 1 are identical for this PR and the deepspeed branch. Training results for standard FP32 mode, PyTorch AMP, and Apex AMP are identical for this PR and the main branch.

Pull Request Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mjdenkowski commented 2 years ago

Great addition, thanks!

I am wondering if there is a good way to test some of the added functionality in Github actions. I guess its difficult to install deepspeed in the workflows. Would it be possible to test the checkpoint conversion module somehow?

That's a good idea. It's now possible to pip install deepspeed on CPU-only machines. Training won't run (GPUs required), but some utility functions will.

I've added a test that runs checkpoint conversion with existing model files if the deepspeed package is installed. I've also updated our workflows to install deepspeed, which should be compatible with all of our environments. We can see if anything breaks.

mjdenkowski commented 2 years ago

Our automatic tests now cover installing deepspeed and converting a DeepSpeed checkpoint.