bloomz-mt universal checkpoint

LiuShixing commented 1 year ago

Hello! Thanks a lot for your job! I want to finetune bloomz-mt by your Megatron-DeepSpeed，but I can not find a universal version checkpoint of bloomz-mt or bloomz. I only found the bloom universal checkpoint below. https://huggingface.co/bigscience/bloom-optimizer-states/tree/global_step95000_universal

With limited GPUs，I have to use TP 4, PP 12 to finetune, but I found that you suggest not to merge TP in below document. So I want to find the bloomz-mt universal checkpoint https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/finetune.md

Muennighoff commented 1 year ago

I didn't create universal ckpts for them unfortunately. The options I see are:

Try to get resources to train for 1 step so you can convert to universal ckpt (Note that as PP=72 & TP=1 and you can drop the DP states, you need at least 72 GPUs (probably need some DP to fit it though); You can set LR=0 for that step)
Figure out how to reshape the checkpoints / convert them to universal ckpts (I would hope that DeepSpeed has some code for this by now, but maybe not)
Restart from bloom (The largest models were finetuned for 500 steps so it's not too much effort)

LiuShixing commented 1 year ago

Thinks, it seems that option 3 is best for me.

发自我的iPhone

------------------ Original ------------------ From: Niklas Muennighoff @.> Date: Fri,May 26,2023 7:20 AM To: bigscience-workshop/xmtf @.> Cc: LiuShixing @.>, Author @.> Subject: Re: [bigscience-workshop/xmtf] bloomz-mt universal checkpoint (Issue#20)

I didn't create universal ckpts for them unfortunately. The options I see are:

Try to get resources to train for 1 step so you can convert to universal ckpt (Note that as PP=72 & TP=1 and you can drop the DP states, you need at least 72 GPUs (probably need some DP to fit it though); You can set LR=0 for that step)

Figure out how to reshape the checkpoints / convert them to universal ckpts (I would hope that DeepSpeed has some code for this by now, but maybe not)

Restart from bloom (The models were trained for 500 steps so it's not too much effort)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

bigscience-workshop / xmtf

bloomz-mt universal checkpoint #20