Closed ArturTanona closed 3 years ago
You are loading from the deepspeed checkpoints (mp_rank...) . I am not sure if they work yet with huggingface transformers and also they are quite huge: 10s of GB. I would recommend you too delete the global_step folder (/datadrive/model/checkpoint-800/global_step800/ in your case) and just start from the model in the checkpoint folder (/datadrive/model/checkpoint-800/) again. This way deepspeed won't try to resume from the deepspeed checkpoint.
In general if there are memory issues, you can always try to reduce the batch size (and in turn increase gradient_accumulation) and you can reduce allgather_bucket_size and reduce_bucket_size to 5e7 in the ds_config_gptneo_new.json file.
Worked! Many thanks! And thanks for this awesome repo!
I have RTX 3090 (24GB) and 64 GB RAM, and 50 GB swap memory, and although training works pretty nicely, unfortunately resuming training from checkpoints results in OOM: