XLM-R finetuning but OOM even with Batchsize as 1?

TingtingLi0101 commented 4 years ago

❓ Questions and Help

I'm trying to finetune XLM-R.large model for sentence prediction task. And it ran in OOM at the first moment even though I decreased the Batchsize to even 1. I tried two different GPU instance type: g4dn.16xlarge (single GPU) and g4dn.12xlarge (multi GPU), but the error seems no differences. Meanwhile it shows that I could load the checkpoint. RuntimeError: CUDA out of memory. Tried to allocate 2.09 GiB (GPU 0; 14.73 GiB total capacity; 11.98 GiB already allocated; 1.33 GiB free; 577.00 MiB cached)

#### What have you tried? ``` MAX_TOKENS=1000 MAX_SENTENCES=1 UPDATE_FREQ=1 XLMR_PATH="xlmr.large/model.pt" ``` #### What's your environment? - fairseq Version: master - PyTorch Version: 1.2.0 - OS (e.g., Linux):Linux - How you installed fairseq (`pip`, source): pip - Python version: 3.6 - CUDA/cuDNN version: 10.0 - GPU models and configuration: - g4dn.16xlarge: 1 GPU, 64 vCPU, 256G Mem, 16G GPU Memory - g4dn.12xlarge: 4 GPU, 48 vCPU, 192G Mem, 64G GPU Memory

myleott commented 4 years ago

Hmm, 16GB is a bit tight for the large model, but you should be able to train something.

Can you share the rest of your training command? Also be careful about sequence lengths -- the memory requirements grow a lot with longer sequences, so if your data has very long sequences and you're not filtering them then that could be a problem.

TingtingLi0101 commented 4 years ago

Yeah, I only changed the MAX_TOKENS, MAX_SENTENCES, and UPDATE_FREQ, and would like to set them as below, but even though I decreased MAX_TOKENS=1000, MAX_SENTENCES=1. Seems still OOM.

Also, the maximum memory/GPU that AWS could provide is 16GB, I'm wondering what kind of GPU that your team use to train the model?

TOTAL_NUM_UPDATES=140_000
WARMUP_UPDATES=int(TOTAL_NUM_UPDATES * 0.06)
LR=1e-05
NUM_CLASSES=3 
MAX_TOKENS=2000
MAX_SENTENCES=16
UPDATE_FREQ=4
XLMR_PATH="xlmr.large/model.pt"

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data-base-bin \
    --restore-file $XLMR_PATH \
    --max-sentences $MAX_SENTENCES \
    --max-tokens $MAX_TOKENS \
    --task sentence_prediction \
    --max-positions 512 \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --num-classes $NUM_CLASSES \
    --arch roberta_base \
    --criterion sentence_prediction \
    --dropout 0.1 --attention-dropout 0.1  \
    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
    --update-freq $UPDATE_FREQ \
    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --max-epoch 10 \
    --skip-invalid-size-inputs-valid-test \
    --truncate-sequence \
    --save-dir i18n_checkpoints_1e5 \
    --tensorboard-logdir "i18n_tf_board_1e5/" \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --find-unused-parameters;

myleott commented 4 years ago

the maximum memory/GPU that AWS could provide is 16GB

AWS does have p3dn.24xlarge instances with 32GB V100 GPUs, which is the same type of GPU that we used.

cc @ngoyal2707 about expected memory usage for the large model.

TingtingLi0101 commented 4 years ago

Ah, that make sense to me. But AWS doesn't provide p3dn.24xlarge instance for per hour usage, I could start with xlmr.base model instead. Thanks for your reply!

mohammedayub44 commented 4 years ago

@myleott any runtime stats that you can share for large or small model. Per epoch runtime, total time taken etc.

@ngoyal2707 any memory usage stats you can share for training the large or small xlm-r model.

Thanks !

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

XLM-R finetuning but OOM even with Batchsize as 1? #1749

❓ Questions and Help