Oneflow-Inc / DLPerf

DeepLearning Framework Performance Profiling Toolkit
Apache License 2.0
275 stars 27 forks source link

运行bert for pytorch报错Out of memory问题 #118

Open adloph1234 opened 3 years ago

adloph1234 commented 3 years ago

使用nvidia提供的pytorch docker运行Bert时,精度为fp32,batch size=32或者以上时会报错out of memory,设置的参数和硬件配置和https://github.com/Oneflow-Inc/DLPerf/tree/master/NVIDIADeepLearningExamples/PyTorch/BERT 相同,请问下这个是什么原因呢?

Flowingsun007 commented 3 years ago

你好,首先请确保GPU环境是:GPU:Tesla V100-SXM2-16GB x 8,其次可能的原因有docker运行时未设定足够大小的内存,如: --shm-size=16g

adloph1234 commented 3 years ago

谢谢。 用df能看到docker的shm-size是16g(由于图片无法上传,就用文本复制) tmpfs 131862444 0 131862444 0% /sys/fs/cgroup shm 16777216 0 16777216 0% /dev/shm /dev/mapper/node105--vg-root 1920488384 1688205160 134705036 93% /etc/hosts tmpfs 131862444 12 131862432 1% /proc/driver/nvidia

其它训练参数信息:

报错信息: Iteration: 0%| | 0/12776 [00:00<?, ?it/s]Traceback (most recent call last): File "/workspace/examples/bert/run_pretraining.py", line 654, in args, final_loss, train_time_raw, global_step = main() File "/workspace/examples/bert/run_pretraining.py", line 571, in main prediction_scores, seq_relationship_score = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, kwargs) File "/workspace/examples/bert/modeling.py", line 889, in forward encoded_layers, pooled_output = self.bert(input_ids, token_type_ids, attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, kwargs) File "/workspace/examples/bert/modeling.py", line 824, in forward encoded_layers = self.encoder(embedding_output, extended_attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, *kwargs) File "/workspace/examples/bert/modeling.py", line 508, in forward hidden_states = layer_module(hidden_states, attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, kwargs) File "/workspace/examples/bert/modeling.py", line 470, in forward intermediate_output = self.intermediate(attention_output) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, *kwargs) File "/workspace/examples/bert/modeling.py", line 443, in forward hidden_states = self.dense_act(hidden_states) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, **kwargs) File "/workspace/examples/bert/modeling.py", line 174, in forward return self.biased_act_fn(self.bias, F.linear(input, self.weight, None)) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.78 GiB total capacity; 14.78 GiB already allocated; 9.44 MiB free; 14.83 GiB reserved in total by PyTorch)

adloph1234 commented 3 years ago

硬件信息: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:2D:00.0 Off | 0 | | N/A 36C P0 43W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... Off | 00000000:32:00.0 Off | 0 | | N/A 42C P0 58W / 300W | 4076MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... Off | 00000000:5B:00.0 Off | 0 | | N/A 36C P0 44W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... Off | 00000000:5F:00.0 Off | 0 | | N/A 33C P0 41W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... Off | 00000000:B5:00.0 Off | 0 | | N/A 37C P0 42W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... Off | 00000000:BE:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... Off | 00000000:DF:00.0 Off | 0 | | N/A 36C P0 43W / 300W | 12MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... Off | 00000000:E7:00.0 Off | 0 | | N/A 38C P0 56W / 300W | 4240MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================|

nlqq commented 3 years ago

这种时候还有一种可能是路径错误,NVIDIA 仓库的代码在路径错误时会显示为显存 OOC,请检查一下所填写的所有路径是否存在,数据集所在路径是否有效。 对于一些常见的 Q&A,可在https://zhuanlan.zhihu.com/p/276154597 这篇文章中找到答案。

adloph1234 commented 3 years ago

@nlqq 谢谢 batch size=16时是可以运行的,能排除路径错误的情况。查看了一下nvidia 对pytorch docker的测试情况,https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-on-multiple-nvidia-dgx-1-with-16g,能看到nvidia的报告中batch size也是16。所以想请问我们这个测试是做了什么特别的设置吗?

nlqq commented 3 years ago

还需要修改容器中 /workspace/examples/bert_config.json 文件如下:

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

为了能够在单机上运行bert,部分参数做了如上修改。