DeepSpeed checkpoint performance improvements

I have a PR open on Microsoft's DeepSpeed page that parallelizes the task of writing per-layer checkpoint files across data parallel instances:

https://github.com/microsoft/DeepSpeed/pull/1419

On my system, I found that this reduces checkpoint cost -- much of the time seems to be spent processing data structures in torch.save() rather than actually writing the bytes to disk. I'm curious to know whether the bigscience training runs might benefit (or would have benefitted) from a change like this. I also have a follow on PR that improves upon checkpointing further, but it requires this first PR.

I have updated the PR a few times over the past 9 months to keep up with changes on the main branch, but it has merge conflicts again. I've lost track of the precise version of DeepSpeed being used in the bigscience training runs.

Would someone be willing to try this out?

First or all, do the changes in this PR apply cleanly to the bigscience DeepSpeed version?

If not, would you please point me to the version that is being used?

Thanks.

bigscience-workshop / Megatron-DeepSpeed

DeepSpeed checkpoint performance improvements #312