Closed mjdenkowski closed 2 years ago
Great addition, thanks!
I am wondering if there is a good way to test some of the added functionality in Github actions. I guess its difficult to install deepspeed in the workflows. Would it be possible to test the checkpoint conversion module somehow?
That's a good idea. It's now possible to pip install deepspeed
on CPU-only machines. Training won't run (GPUs required), but some utility functions will.
I've added a test that runs checkpoint conversion with existing model files if the deepspeed
package is installed. I've also updated our workflows to install deepspeed
, which should be compatible with all of our environments. We can see if anything breaks.
Our automatic tests now cover installing deepspeed
and converting a DeepSpeed checkpoint.
This PR adds training support for DeepSpeed with ZeRO stage 1:
pip install deepspeed
deepspeed --no_python ... sockeye-train ...
--deepspeed-fp16
or--deepspeed-bf16
--local_rank
option to Sockeye, which activates DeepSpeed mode.Sockeye's DeepSpeed mode does the following:
DeepSpeed saves checkpoints using its own format. The new
convert_deepspeed
module recovers standard Sockeye parameter files from these checkpoints. At the end of training, checkpoints are automatically converted. Checkpoints can also be converted manually using the CLI.DeepSpeed is an optional dependency (like Apex) and these changes are backward compatible. Training results with DeepSpeed ZeRO stage 1 are identical for this PR and the
deepspeed
branch. Training results for standard FP32 mode, PyTorch AMP, and Apex AMP are identical for this PR and themain
branch.Pull Request Checklist
pytest
)pytest test/system
)./style-check.sh
)sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.