No-ZeRO reshaping - Githubissues

bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.33k stars 215 forks source link

No-ZeRO reshaping #289

Open Muennighoff opened 2 years ago

Muennighoff commented 2 years ago

Should be merged first: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/239 Only adds tools/convert_checkpoint/deepspeed_to_deepspeed_nozero.py

Our small models are trained without ZeRO. This script enables reshaping of them.

Tests:

Loss continues where it left off after reshaping from
- PP=4, TP=4 -> PP=2, TP=2 👍
- PP=4, TP=4 -> PP=1, TP=1 👍
- PP=2, TP=1 -> PP=1, TP=1 👍
Checkpoint size stays the same 👍

Notes:

I'm not doing any black formatting etc, as this is not a production codebase - Let me know if that's not okay & the code should be cleaner!