NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1 #966

Closed sambar1729 closed 3 weeks ago

sambar1729 commented 1 month ago

Your question I want to ingest a checkpoint from HF into Megatron LM and then continue training on that. For the latter part (training) I will need TP > 1 or PP > 1 (Given the model size and the gpu memory i have). For this, when I convert the HF checkpoint to work with Megatron, I need the TP and PP values to match what I need in the training part.

However, right now the conversion scripts from HF to mcore seems to take in PP = 1 and TP = 1 (I am hoping I am mistaken here). How do I use the conversion scripts in tools/checkpoints/convert.py so I may be able to use TP > 1 and/or PP > 1?

Thanks.

Edit: I am guessing this is answered (in the negative that no, there is no way currently to do this conversion) by https://github.com/NVIDIA/Megatron-LM/issues/296#issuecomment-1732407585 -- wondering if we have any updates here.