ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

What GPU needed to finetune Large version? #27

Closed Rai220 closed 3 years ago

Rai220 commented 3 years ago

I have 16Gb GPU and get CUDA out of memory error (for batch size = 1!):

RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.76 GiB total capacity; 13.25 GiB already allocated; 21.44 MiB free; 13.84 GiB reserved in total by PyTorch)

Is this memory really not enough to train the large version? May be there is some tips to reduce memory using on pretraining? I using such list of parameters:

    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --overwrite_cache \
    --num_train_epochs 2 \
    --save_steps 1000 \
    --block_size 256 \
    --fp16
OzoneReloaded commented 3 years ago

Hello! I've managed to run finetuning on 11 gb GPU with:

gpt_options="\ --hidden-size 1024 \ --seq-length 1024 \ --cpu-optimizer \ --cpu_torch_adam \ "

Hope it helps. @Rai220

fen0s commented 3 years ago

I have 16Gb GPU and get CUDA out of memory error (for batch size = 1!):

RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.76 GiB total capacity; 13.25 GiB already allocated; 21.44 MiB free; 13.84 GiB reserved in total by PyTorch)

Is this memory really not enough to train the large version? May be there is some tips to reduce memory using on pretraining? I using such list of parameters:

    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --overwrite_cache \
    --num_train_epochs 2 \
    --save_steps 1000 \
    --block_size 256 \
    --fp16

Apparently, optimization level of O3 helps, but I haven't quite figured out how to make it generate samples, it just outputs negative probability for some reason. The above answer is for GPT-3 large, not GPT-2 large, so...

fen0s commented 3 years ago

Basically what's needed is gradient checkpointing that was provided in one of transformers library versions. Not sure if I can implement it, especially considering that old versions of transformers library is used in here...

TatianaShavrina commented 3 years ago

Hey @Rai220 @fen0s The organizers gave participants the opportunity to get access to Cristofari. To get access, please send to AIJ_ruGPT-3@sberbank.ru your request with brief information about your project. We will review your request and get back to you. Please note that the number of such accesses is limited. If necessary, please leave your request as early as possible.