codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.09k stars 1.29k forks source link

OOM error in cuda while passing large corpus of wikipedia text files ? how to manage big files to train #64

Open MohamedLotfyElrefai opened 5 years ago

MohamedLotfyElrefai commented 5 years ago

I have used this parameter

bert -c /home/ai/LM_fit/bert/bert_pytorch/dataset/wiki_arabic.txt -v

/home/ai/LM_fit/bert/bert_pytorch/dataset/wiki_vocab.small -o

/home/ai/LM_fit/bert/bert_pytorch/dataset/wiki_model_cpu -hs 240 -l 3 -a 3 -s 30 -b 8

--on_memory False --with_cuda True -w 4

Loading Vocab /home/ai/LM_fit/bert/bert_pytorch/dataset/wiki_vocab.small Vocab Size: 2135556 Loading Train Dataset /home/ai/LM_fit/bert/bert_pytorch/dataset/wiki_arabic.txt Loading Dataset: 1760404it [00:03, 497901.29it/s] Loading Test Dataset None Creating Dataloader Building BERT model Creating BERT Trainer Total Parameters: 1029286598 Training Start EP_train:0: 0%|| 0/220051 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ai/py3.6/bin/bert", line 11, in <module> sys.exit(train()) File "/home/ai/py3.6/lib/python3.6/site-packages/bert_pytorch/__main__.py", line 67, in train trainer.train(epoch) File "/home/ai/py3.6/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 81, in train self.iteration(epoch, self.train_data) File "/home/ai/py3.6/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 132, in iteration loss.backward() File "/home/ai/py3.6/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ai/py3.6/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 10.92 GiB total capacity; 9.59 GiB already allocated; 230.62 MiB free; 13.23 MiB cached)