ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 442 forks source link

Torch takes almost all the memory even on large GPU #57

Closed VasylVaskivskyi closed 3 years ago

VasylVaskivskyi commented 3 years ago

I tried to run code and pretrained network provided in this notebook, https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb but I can't run more then one batch per GPU with my data. It is not very big - 90 MB, but it takes forever to do with 1 batch per step.

So, every time I run this command

!CUDA_VISIBLE_DEVICES=0 python ru-gpts/pretrain_transformers.py \
    --output_dir=models/essays \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt3small_based_on_gpt2 \
    --do_train \
    --train_data_file=train.txt \
    --do_eval \
    --eval_data_file=valid.txt \
    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --block_size 2048 \
    --overwrite_output_dir

with --per_gpu_train_batch_size > 1 I get an error RuntimeError: CUDA out of memory.. The error shows that >90% of memory is allocated to torch and the rest is not enough to run multiple batches. And this happens on GPUs with any amount of memory: 10, 15, 30 GB.

Could you please fix the amount of memory preallocated to pytorch.

king-menin commented 3 years ago

Try use deepspeed version of training. Also if you use big models memory is leak because It has many parameters. We suggest you use deepspeed based training. For example: https://github.com/sberbank-ai/ru-gpts/blob/master/scripts/deepspeed_gpt3_large.sh