Closed VasylVaskivskyi closed 3 years ago
Try use deepspeed version of training. Also if you use big models memory is leak because It has many parameters. We suggest you use deepspeed based training. For example: https://github.com/sberbank-ai/ru-gpts/blob/master/scripts/deepspeed_gpt3_large.sh
I tried to run code and pretrained network provided in this notebook,
https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb
but I can't run more then one batch per GPU with my data. It is not very big - 90 MB, but it takes forever to do with 1 batch per step.So, every time I run this command
with
--per_gpu_train_batch_size
> 1 I get an errorRuntimeError: CUDA out of memory.
. The error shows that >90% of memory is allocated to torch and the rest is not enough to run multiple batches. And this happens on GPUs with any amount of memory: 10, 15, 30 GB.Could you please fix the amount of memory preallocated to pytorch.