Your question
I want to ingest a checkpoint from HF into Megatron LM and then continue training on that. For the latter part (training) I will need TP > 1 or PP > 1 (Given the model size and the gpu memory i have). For this, when I convert the HF checkpoint to work with Megatron, I need the TP and PP values to match what I need in the training part.
However, right now the conversion scripts from HF to mcore seems to take in PP = 1 and TP = 1 (I am hoping I am mistaken here). How do I use the conversion scripts in tools/checkpoints/convert.py so I may be able to use TP > 1 and/or PP > 1?
Your question I want to ingest a checkpoint from HF into Megatron LM and then continue training on that. For the latter part (training) I will need TP > 1 or PP > 1 (Given the model size and the gpu memory i have). For this, when I convert the HF checkpoint to work with Megatron, I need the TP and PP values to match what I need in the training part.
However, right now the conversion scripts from HF to mcore seems to take in PP = 1 and TP = 1 (I am hoping I am mistaken here). How do I use the conversion scripts in
tools/checkpoints/convert.py
so I may be able to use TP > 1 and/or PP > 1?Thanks.
Edit: I am guessing this is answered (in the negative that no, there is no way currently to do this conversion) by https://github.com/NVIDIA/Megatron-LM/issues/296#issuecomment-1732407585 -- wondering if we have any updates here.