Open mirrorQAQ opened 3 years ago
老哥,感谢!我用了你的代码,现在在msra数据集的句子长度截断了,但是训练时依然会报OOM,不知道你有没有遇到?
train:7026
train max_seq_len:387
train max_lex_num:233
train max_seq_lex:601
test max_seq_len:351
test max_lex_num:231
test max_seq_lex:572
loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt
Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin.
Start to generate word pieces for word.
Found(Or segment into word pieces) 116198 words out of 116559.
training epochs started 2020-12-15-13-55-24
Traceback (most recent call last):
File "flat_main.py", line 588, in <module>
trainer.train()
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train
self.callback_manager.on_exception(e)
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper
returns.append(getattr(callback, func.__name__)(*arg))
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception
raise exception # 抛出陌生Error
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train
self._train()
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train
eval_res = self._do_validation(epoch=epoch, step=self.step)
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation
res = self.tester.test()
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test
pred_dict = self._data_forward(self._predict_func, batch_x)
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
y = self._predict_func_wrapper(**x)
File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward
embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e)
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward
rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e)
File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward
pe_2 = torch.cat([pe_ss,pe_ee],dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)
你把batch size调小试试, 16尝试一下
我设置了batch为2,梯度积累5步,可是中间还是炸了。。。
估计是测试的test_batch太大了(设置是10),debug发现是在测试爆的
老哥,感谢!我用了你的代码,现在在msra数据集的句子长度截断了,但是训练时依然会报OOM,不知道你有没有遇到?
train:7026 train max_seq_len:387 train max_lex_num:233 train max_seq_lex:601 test max_seq_len:351 test max_lex_num:231 test max_seq_lex:572 loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin. Start to generate word pieces for word. Found(Or segment into word pieces) 116198 words out of 116559. training epochs started 2020-12-15-13-55-24 Traceback (most recent call last): File "flat_main.py", line 588, in <module> trainer.train() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train self.callback_manager.on_exception(e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper returns.append(getattr(callback, func.__name__)(*arg)) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception raise exception # 抛出陌生Error File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train self._train() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train eval_res = self._do_validation(epoch=epoch, step=self.step) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation res = self.tester.test() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test pred_dict = self._data_forward(self._predict_func, batch_x) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward y = self._predict_func_wrapper(**x) File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward pe_2 = torch.cat([pe_ss,pe_ee],dim=-1) RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)
啊 你是不是没有清理cache啊。我看你这边的错误信息,如果你用的我是我的代码处理过的句子的话 不会出现长度超过300的样本哦。超过300的我这边是skip了
老哥,感谢!我用了你的代码,现在在msra数据集的句子长度截断了,但是训练时依然会报OOM,不知道你有没有遇到?
train:7026 train max_seq_len:387 train max_lex_num:233 train max_seq_lex:601 test max_seq_len:351 test max_lex_num:231 test max_seq_lex:572 loading vocabulary file /home/llq/.fastNLP/embedding/bert-chinese-wwm/vocab.txt Load pre-trained BERT parameters from file /home/llq/.fastNLP/embedding/bert-chinese-wwm/chinese_wwm_pytorch.bin. Start to generate word pieces for word. Found(Or segment into word pieces) 116198 words out of 116559. training epochs started 2020-12-15-13-55-24 Traceback (most recent call last): File "flat_main.py", line 588, in <module> trainer.train() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 613, in train self.callback_manager.on_exception(e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 309, in wrapper returns.append(getattr(callback, func.__name__)(*arg)) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/callback.py", line 505, in on_exception raise exception # 抛出陌生Error File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 609, in train self._train() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 697, in _train eval_res = self._do_validation(epoch=epoch, step=self.step) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 714, in _do_validation res = self.tester.test() File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 165, in test pred_dict = self._data_forward(self._predict_func, batch_x) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/fastNLP/core/tester.py", line 213, in _data_forward y = self._predict_func_wrapper(**x) File "/home/llq/homework/Flat-Lattice-Transformer/models.py", line 511, in forward embedding, seq_len, lex_num=lex_num, pos_s=pos_s, pos_e=pos_e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 1277, in forward rel_pos_embedding = self.four_pos_fusion_embedding(pos_s,pos_e) File "/home/llq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/llq/homework/Flat-Lattice-Transformer/modules.py", line 110, in forward pe_2 = torch.cat([pe_ss,pe_ee],dim=-1) RuntimeError: CUDA out of memory. Tried to allocate 3.43 GiB (GPU 0; 10.91 GiB total capacity; 7.58 GiB already allocated; 2.49 GiB free; 7.87 GiB reserved in total by PyTorch)
啊 你是不是没有清理cache啊。我看你这边的错误信息,如果你用的我是我的代码处理过的句子的话 不会出现长度超过300的样本哦。超过300的我这边是skip了
我修改了一下长度,cache清理过了,目前应该是test时候batch太大导致出错,总之谢谢老哥的代码😄
用了你的代码解决了OOM的问题,感谢老哥!
您好,请问这个cache说的是缓存的数据集吗(我理解不需要清理呀,它存在磁盘,运行的时候才会去加载)还是在训练中需要什么操作可以防止爆显存呢~谢谢。
在32G 的卡上跑,clip 默认设置的200。 batch-size=1都会溢出。也是服了。 然后发现preprocess里面的clip 其实没有clip掉。最长的句子竟然会出现1146个char 溢出应该也正常吧。。。 修改了代码,贴上来,有遇到同样问题的人可以参考一下。