649453932 / Bert-Chinese-Text-Classification-Pytorch

使用Bert,ERNIE,进行中文文本分类
MIT License
3.97k stars 896 forks source link

请问大家在bert训练时遇到过显存泄漏吗? #106

Open Wanjun0511 opened 3 years ago

Wanjun0511 commented 3 years ago

前面几千个step比较稳定,占用一半的显存,训练正常,loss会下降,后面会突然持续增大显存占用,然后爆掉,基本上报错地方都在这里。 求解~~

File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/embed.py", line 84, in forward text_encodedlayer, = self.bert_model(text_var, text_segments_ids, output_all_encoded_layers=False) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py", line 733, in forward output_all_encoded_layers=output_all_encoded_layers) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py", line 406, in forward hidden_states = layer_module(hidden_states, attention_mask) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py", line 392, in forward intermediate_output = self.intermediate(attention_output) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, *kwargs) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py", line 365, in forward hidden_states = self.intermediate_act_fn(hidden_states) File "/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py", line 124, in gelu return x 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) RuntimeError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 3; 10.92 GiB total capacity; 9.54 GiB already allocated; 15.38 MiB free; 10.35 GiB reserved in total by PyTorch)

igeng commented 2 years ago

这不就是显卡内存不够了呗,人家用的是2080ti,你试试减小batch_size,或者减少数据试试。

liuxiaobei6667 commented 2 years ago

这不是刻录了呗,是人家2080ti,用你很好的数据,或者说数据试试。

怎么解决

liuxiaobei6667 commented 2 years ago

前面的显个步骤比较正常占用占用,锻炼,损失,然后会突然出现,然后会爆掉,正常报地方都显存占用 ~~

文件“/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”,第 532 行,调用 结果 = self.forward(*input, kwargs) 文件“/home/embed.py”,第 84 行,向前 text_encodedlayer, = self.bert_model(text_var, text_segments_ids, output_all_encoded_layers=False) 文件“/home/anaconda3/envs/python2.7/lib/python2. 7/site-packages/torch/nn/modules/module.py”,第 532 行,调用 结果 = self.forward(*input, kwargs) 文件“/home/anaconda3/envs/python2.7/lib/ python2.7/site-packages/pytorch_pretrained_bert/modeling.py",第 733 行,前向 output_all_encoded_layers=output_all_encoded_layers) 文件“/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”,第 532 行,调用* 结果 = self.forward(input, kwargs) 文件“/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrainedbert/modeling.py”,第 406 行,向前 hidden​​states = layermodule(hidden​​states, attention_mask) 文件“/home/ anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py",第 532 行,调用 结果 = self.forward(*input, kwargs) 文件"/ home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py",第 392 行,向前 middle_output = self.intermediate(attention_output) 文件“/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”,第 532 行,调用* 结果 = self.forward(input, * kwargs) 文件“/home/anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrainedbert/modeling.py”,第 365 行,向前 hidden​​states = self.intermediate_actfn(hidden​​states) 文件“/home/ anaconda3/envs/python2.7/lib/python2.7/site-packages/pytorch_pretrained_bert/modeling.py",第 124 行,在 gelu 中 返回 x 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0) )) RuntimeError: CUDA 内存不足。尝试分配 54.00 MiB(GPU 3;10.92 GiB 总容量;9.54 GiB 已分配;15.38 MiB 空闲;PyTorch 总共保留 10.35 GiB)

你最后解决了吗