Closed polisettyvarma closed 1 month ago
How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?
Based on this doc - https://github.com/NVIDIA/Megatron-LM/blob/main/docs/source/api-guide/dist_checkpointing.rst There are some conflicting statements.
Can you provide a working example end to end to showcase this feature, Thanks.
You can checkout convert in tools folder,tools/checkpoint/convert.py. This may help you.
tools/checkpoint/convert.py
How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?
Based on this doc - https://github.com/NVIDIA/Megatron-LM/blob/main/docs/source/api-guide/dist_checkpointing.rst There are some conflicting statements.
Can you provide a working example end to end to showcase this feature, Thanks.