Deepspeed training overflows

ai-forever / ru-gpts

Russian GPT3 models.

Apache License 2.0

2.08k stars 444 forks source link

Deepspeed training overflows #63

Closed drunkinlove closed 1 year ago

drunkinlove commented 3 years ago

Hi! Thanks for replying to my earlier issues :)

I'm currently trying to finetune a model with deepspeed using scripts/deepspeed_gpt3_medium.sh as an example. After a while (usually 16k steps) training basically hangs with the following message repeated:

1622513061366 localhost info [2021-06-01 05:04:21,527] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0.0, reducing to 0.0

Meaning the weight updates are too large and training failed to converge, right? I've also tried setting a lower LR (as in deepspeed_gpt3_xl_finetune.sh), but the dynamic is the same.

Have you run into this problem at any point? I'd appreciate any advice.

king-menin commented 3 years ago

try to change deepspeed config. also check your batch size and check matching to deepspeed config. but loss scale is usually on start of training. at the end of training this may be overfitting. try to evaluate older checkpoints on some test tasks.

king-menin commented 3 years ago

what perplexity of model do you have on 16k step?

ollmer commented 3 years ago

Sometimes this issue arises during fp16 training. We recommend to:

Try to decrease learning rate 2-4 times and resume training from last saved successful step.
If it doesn't help, you could try to resume training in fp32 mode for a few thousand steps. It would be slower and possibly require to decrease batch size to fit in memory. Hope it helps!

drunkinlove commented 3 years ago

@king-menin Perplexity at step 16k is around 160. I found train_micro_batch_size_per_gpu=4 in the deepspeed config, is it supposed to equal the batch-size arg for pretrain_gpt3.py? Also, isn't overfitting supposed to lead to loss (and weight) stability? Whereas my model seems to get huge updates that overflow in FP16...

@ollmer I'd like to try that, but when I load a deepspeed checkpoint I get the following error:

File "/home/user/miniconda3/envs/gpt_train/lib/python3.8/site-packages/torch/serialization.py", line 831, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
OSError: [Errno 14] Bad address

would you happen to know how to fix this?