Closed drunkinlove closed 1 year ago
try to change deepspeed config. also check your batch size and check matching to deepspeed config. but loss scale is usually on start of training. at the end of training this may be overfitting. try to evaluate older checkpoints on some test tasks.
what perplexity of model do you have on 16k step?
Sometimes this issue arises during fp16 training. We recommend to:
@king-menin Perplexity at step 16k is around 160. I found train_micro_batch_size_per_gpu=4
in the deepspeed config, is it supposed to equal the batch-size
arg for pretrain_gpt3.py?
Also, isn't overfitting supposed to lead to loss (and weight) stability? Whereas my model seems to get huge updates that overflow in FP16...
@ollmer I'd like to try that, but when I load a deepspeed checkpoint I get the following error:
File "/home/user/miniconda3/envs/gpt_train/lib/python3.8/site-packages/torch/serialization.py", line 831, in load_tensor
storage = zip_file.get_storage_from_record(name, size, dtype).storage()
OSError: [Errno 14] Bad address
would you happen to know how to fix this?
Hi! Thanks for replying to my earlier issues :)
I'm currently trying to finetune a model with deepspeed using scripts/deepspeed_gpt3_medium.sh as an example. After a while (usually 16k steps) training basically hangs with the following message repeated:
Meaning the weight updates are too large and training failed to converge, right? I've also tried setting a lower LR (as in deepspeed_gpt3_xl_finetune.sh), but the dynamic is the same.
Have you run into this problem at any point? I'd appreciate any advice.