Open sxthunder opened 1 year ago
@StellaAthena
Hi, have you solved the problem? I meet the same problem.
Hi, have you solved the problem? I meet the same problem.
If using megatron, seems like you must load zero optimizer. Keep MP size same, gpu num could be different. Or transfer model to HF, but the training speed is low
Hi, have you solved the problem? I meet the same problem.
If using megatron, seems like you must load zero optimizer. Keep MP size same, gpu num could be different. Or transfer model to HF, but the training speed is low
I keep the mp size the same, while the gpu num was changed. But the loss scaled to 10+, even I set the load_optimizer_states=False
,load_lr_scheduler_states=False
and load_module_only=True
.
Can you explain why this isn’t the desired behavior?
I transformed the parameters to huggingface without ds-zero-states, it works well. Why does gpt-neox must load zero-states?
Can you explain why this isn’t the desired behavior?
Does GPT-Neox 2.0 not support finetune model using different gpu nums? I pretrain 6B model using GPT-Neox 2.0 with 256 GPUS, then finetuning using 32 GPUS. The logs shows model states and zero optimizer are successfully loaded, But Loss explose after second steps.
Can you explain why this isn’t the desired behavior?
Does GPT-Neox 2.0 not support finetune model using different gpu nums? I pretrain 6B model using GPT-Neox 2.0 with 256 GPUS, then finetuning using 32 GPUS. The logs shows model states and zero optimizer are successfully loaded, But Loss explose after second steps.
We currently have a PR working its way through that will fix this problem. We hope to have it merged later this week. https://github.com/EleutherAI/gpt-neox/pull/836
Can you explain why this isn’t the desired behavior?
Does GPT-Neox 2.0 not support finetune model using different gpu nums? I pretrain 6B model using GPT-Neox 2.0 with 256 GPUS, then finetuning using 32 GPUS. The logs shows model states and zero optimizer are successfully loaded, But Loss explose after second steps.
@StellaAthena Thanks, we are looking forward to be the first user
Describe the bug I have trained a 1.3B model on 64 A100 80G Gpus, I export the saved checkpoints except the deepspeed zero-optimal states, the exported ckpts structure is same as your opensource 20B ckpts. Then I want to fine-tune on the model on 8 Gpus, only adding {"finetune":true} in config yaml.
when I run the program, Model parameters successful loaded:
But after that, it wants to load zero-optimal parameters, obviously the parameters are mssing:
Then Model starts training, but the loss scale is unnoral, you can find the first 10 steps are skipped
10-20 Steps loss is 6.+, after 20 steps, the loss scaled to 10+
After that, the loss decreases like pre-training, I test the finetune model which is clearly unnormal.
The I re-finetune the model on 64 Gpus with ds zero-optimal states, everything goes well:
Is this a Bug or I missed some process steps on the pretrain saved ckpts?