transformer_lm_gpt2_big OOM: Ran out of memory with exception: CUDA out of memory

yuan-commits commented 4 years ago

❓ Questions and Help

What is your question?

I am training a GPT2-big model on a large dataset (6600M words) on NVIDIA V100 32G * 8. It shows OOM error when training. Any methods for optimize GPU memory and fix OOM?

In addition, is there any instruction or demo training script for training gpt2-large besides the LM readme?

2020-06-26 19:25:17 | INFO | fairseq_cli.train | model transformer_lm_gpt2_big, criterion CrossEntropyCriterion
2020-06-26 19:25:17 | INFO | fairseq_cli.train | num. model params: 2174390400 (num. trained: 2174390400)
2020-06-26 19:25:21 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-06-26 19:25:21 | INFO | fairseq_cli.train | max tokens per GPU = 64 and max sentences per GPU = None
2020-06-26 19:25:21 | INFO | fairseq.trainer | no existing checkpoint found /mnt/output/yuanz/fairseq_exp/output/lm_adaptoptim_new_eos_gpt_big_maxtk64/checkpoint_last.pt
2020-06-26 19:25:21 | INFO | fairseq.trainer | loading train data for epoch 1
2020-06-26 19:27:26 | INFO | fairseq.data.data_utils | loaded 406129439 examples from: /mnt/default/yuanz/fairseq_exp/newdata_base/train
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:38:58.405576    97 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp2878werq, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:38:58.627735   466 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:38:58.627791   466 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:00.788980    97 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmptiqqh4ze, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:01.014852   512 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:39:01.014906   512 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:03.540951    97 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmph8gdj_7v, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:03.768124   517 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:39:03.768196   517 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:39:05.129994    99 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpmkuzd25z, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:05.360617   523 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:39:05.360677   523 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:07.362105    99 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpmt5b0w1r, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:07.589992   568 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:39:07.590052   568 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:10.007517    99 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpgc170iu6, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:10.227804   573 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:39:10.227859   573 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:39:16.043857    96 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpjl1kpiys, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:16.269462   583 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:39:16.269541   583 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:18.382289    96 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpwpp_n_cu, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:18.617144   628 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:39:18.617205   628 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:39:20.828233    96 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpp2nsl0i5, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:39:21.062494   637 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:39:21.062613   637 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:41:50.126955    98 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp8y9_f5f_, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:41:50.355329   687 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:41:50.355381   687 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:41:53.898403    98 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpvhc988wk, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:41:54.121002   732 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:41:54.121058   732 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:41:59.600651    98 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpiqw2ruxy, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:41:59.821144   737 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:41:59.821197   737 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:42:17.773564    95 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpfxgwh_vu, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:42:18.024216   744 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:42:18.024271   744 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:42:21.342677    95 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpk7cirnav, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:42:21.570621   789 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:42:21.570680   789 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:42:27.059877    95 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpet3gikrp, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:42:27.285737   794 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:42:27.285796   794 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:43:05.306855    93 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmppk7cavw9, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:05.541484   822 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:43:05.541538   822 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:43:07.924999    93 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp8sfw9ihi, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:08.153175   868 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:43:08.153226   868 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:43:11.817020    93 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp83b0uz6p, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:12.038414   873 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:43:12.038467   873 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:43:40.714576    94 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpzbym4unw, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:40.933988   892 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:43:40.934039   892 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:43:45.289515    94 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpgnvbgwfb, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:45.518719   941 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:43:45.518774   941 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:43:49.297312    94 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp11ii628l, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:43:49.522616   946 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:43:49.522667   946 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0626 19:45:06.083225   100 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmpvtda_un7, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:45:06.327651   983 store.cc:1149] Allowing the Plasma store to use up to 3.41149GB of memory.
I0626 19:45:06.327710   983 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:45:09.106070   100 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp91ovxaxu, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:45:09.327500  1028 store.cc:1149] Allowing the Plasma store to use up to 6.82297GB of memory.
I0626 19:45:09.327555  1028 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
E0626 19:45:14.426862   100 io.cc:168] Connection to IPC socket failed for pathname /tmp/tmp4sx9mouv, retrying 20 more times
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 19:45:14.643324  1033 store.cc:1149] Allowing the Plasma store to use up to 10.2345GB of memory.
I0626 19:45:14.643376  1033 store.cc:1176] Starting object store with directory /dev/shm and huge page support disabled
2020-06-26 19:46:54 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-06-26 19:47:11 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 8.10 GiB (GPU 7; 31.75 GiB total capacity; 24.94 GiB already allocated; 5.44 GiB free; 25.24 GiB reserved in total by PyTorch)
...
Traceback (most recent call last):
  File "/opt/conda/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/code/fairseq/fairseq_cli/train.py", line 370, in cli_main
    nprocs=args.distributed_world_size,
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/code/fairseq/fairseq_cli/train.py", line 338, in distributed_main
    main(args, init_distributed=True)
  File "/code/fairseq/fairseq_cli/train.py", line 121, in main
    valid_losses = train(args, trainer, task, epoch_itr, max_update)
  File "/opt/conda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/code/fairseq/fairseq_cli/train.py", line 206, in train
    log_output = trainer.train_step(samples)
  File "/opt/conda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/code/fairseq/fairseq/trainer.py", line 504, in train_step
    raise e
  File "/code/fairseq/fairseq/trainer.py", line 486, in train_step
    self.optimizer.step()
  File "/code/fairseq/fairseq/optim/fp16_optimizer.py", line 186, in step
    self.fp32_optimizer.step(closure)
  File "/code/fairseq/fairseq/optim/fairseq_optimizer.py", line 95, in step
    self.optimizer.step(closure)
  File "/code/fairseq/fairseq/optim/nag.py", line 85, in step
    param_state['momentum_buffer'] = torch.zeros_like(d_p)
RuntimeError: CUDA out of memory. Tried to allocate 8.10 GiB (GPU 3; 31.75 GiB total capacity; 24.94 GiB already allocated; 5.46 GiB free; 25.23 GiB reserved in total by PyTorch)

Code

Here is my training script, following the [adaptive LM readme](https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.adaptive_inputs.md). Note that I enable fp16 option. ``` bash # max-tokens 2k fairseq-train --task language_modeling \ "$DATA_DIR" \ --save-dir "$SAVE_DIR" \ --arch transformer_lm_gpt2_big \ --optimizer nag --clip-norm 0.1 \ --lr 0.0001 --lr-scheduler cosine --max-lr 1.0 \ --t-mult 2 --lr-period-updates 270000 --lr-shrink 0.75 \ --warmup-updates 16000 --warmup-init-lr 1e-07 \ --max-tokens 2048 --update-freq 3 \ --tokens-per-sample 3072 --sample-break-mode eos --seed 1 \ --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d \ --min-lr 1e-09 \ --fp16 \ --save-interval-updates 2000 \ --keep-interval-updates 1 # max-tokens 64 fairseq-train --task language_modeling \ "$DATA_DIR" \ --save-dir "$SAVE_DIR" \ --arch transformer_lm_gpt2_big \ --optimizer nag --clip-norm 0.1 \ --lr 0.0001 --lr-scheduler cosine --max-lr 1.0 \ --t-mult 2 --lr-period-updates 270000 --lr-shrink 0.75 \ --warmup-updates 16000 --warmup-init-lr 1e-07 \ --max-tokens 64 --update-freq 96 \ --tokens-per-sample 3072 --sample-break-mode eos --seed 1 \ --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d \ --min-lr 1e-09 \ --fp16 \ --save-interval-updates 2000 \ --keep-interval-updates 1 ``` #### What have you tried? - I have train a transformer_lm_gpt2_small model with the same optimizing strategy, it works well. (not using adam with rvs_sqrt because it always show overflow under fp16) - Reduce the max-token even to 64 as well as increasing update_freq also results in OOM error. #### What's your environment? - fairseq Version (e.g., 1.0 or master): former master (commit 775122950d145382146e9120308432a9faf9a9b8) - PyTorch Version (e.g., 1.0) 1.4.0 - OS (e.g., Linux): Ubuntu 16.04.6 LTS - How you installed fairseq (`pip`, source): editable mode by using: ``` bash git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable . ``` - Build command you used (if compiling from source): None - Python version: 3.7 - CUDA/cuDNN version: CUDA runtime version: 10.0.130 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 - GPU models and configuration: V100 32G * 8 - Any other relevant information:

myleott commented 4 years ago

Several things:

Is your vocabulary size quite large? The model has >2B parameters, which I suspect is partly caused by a large embedding table.
You have --tokens-per-sample 3072, which is too large of a context size for a model with this many parameters. Try 1024 or even smaller (512 could be fine).
You can use --memory-efficient-fp16 instead of --fp16. This is a slightly more aggressive version of mixed precision training which will save memory, but typically requires a large batch size and may produce slightly worse perplexities in the end.

Another idea is to use model parallel training. We support this here: https://github.com/pytorch/fairseq/tree/master/examples/megatron_11b#example-training-command-model-parallel

yuan-commits commented 4 years ago

Thank you for your reply @myleott.

Yes, the vocab size is 200k. Because I train a Chinese language model, and take a word as the input unit rather than a character. We use the word as input because it shows better performance in former experiments.
Sorry for the unclear description. As --sample-break-mode eos is set, --tokens-per-sample 3072 takes no effects and thus can be ignored. I set --sample-break-mode eos because the training corpus is sentence-level and does not contains any document or paragraph information, setting --sample-break-mode none also gives a worse performance in my former experiments (use the model for sentence-level reranking).
Thank you. Does the --memory-efficient-fp16 deal well with overflow? BTW, I also note that in my --fp16 experiments, using adam with ivs_sqrt skips 1/5 epoch samples in the first epoch because of frequent NOTE: overflow detected, setting loss scale to: 64.0 (I also print the L2 norm of the gradient before clipping, it shows inf when overflow detected). When change to nag, it works well (seems that momentum smooths the gradient).

I also note that the oom occurs on nag. Does nag have more memory consumption than adam?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq