[QUESTION] How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.62k stars 2.38k forks source link

[QUESTION] How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ? #1232

Closed polisettyvarma closed 1 month ago

polisettyvarma commented 1 month ago

How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?

Based on this doc - https://github.com/NVIDIA/Megatron-LM/blob/main/docs/source/api-guide/dist_checkpointing.rst There are some conflicting statements.

zarr is not default format as mentioned.
i tried 8 card config, src - tp2pp2dp2, dst - tp2pp1dp4 didn't work

Can you provide a working example end to end to showcase this feature, Thanks.

wplf commented 1 month ago

You can checkout convert in tools folder,tools/checkpoint/convert.py. This may help you.