运行bert for pytorch报错Out of memory问题

adloph1234 commented 3 years ago

使用nvidia提供的pytorch docker运行Bert时，精度为fp32，batch size=32或者以上时会报错out of memory，设置的参数和硬件配置和https://github.com/Oneflow-Inc/DLPerf/tree/master/NVIDIADeepLearningExamples/PyTorch/BERT 相同，请问下这个是什么原因呢？

Flowingsun007 commented 3 years ago

你好，首先请确保GPU环境是：GPU：Tesla V100-SXM2-16GB x 8，其次可能的原因有docker运行时未设定足够大小的内存，如： --shm-size=16g

adloph1234 commented 3 years ago

谢谢。用df能看到docker的shm-size是16g(由于图片无法上传，就用文本复制) tmpfs 131862444 0 131862444 0% /sys/fs/cgroup shm 16777216 0 16777216 0% /dev/shm /dev/mapper/node105--vg-root 1920488384 1688205160 134705036 93% /etc/hosts tmpfs 131862444 12 131862432 1% /proc/driver/nvidia

其它训练参数信息：

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes 1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 /workspace/examples/bert/run_pretraining.py --input_dir=/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/ --output_dir=/workspace/examples/bert/results/checkpoints --config_file=/workspace/examples/bert/bert_config.json --bert_model=bert-base-uncased --train_batch_size=48 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=120 --warmup_proportion=1 --num_steps_per_checkpoint=1000 --learning_rate=6e-3 --seed=42 --do_train --json-summary /workspace/examples/bert/dllogger.json device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: False DLL 2021-01-29 03:51:02.285200 - PARAMETER Config : ["Namespace(allreduce_post_accumulation=False, allreduce_post_accumulation_fp16=False, bert_model='bert-base-uncased', checkpoint_activations=False, config_file='/workspace/examples/bert/bert_config.json', disable_progress_bar=False, do_train=True, fp16=False, gradient_accumulation_steps=1, init_checkpoint=None, input_dir='/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/', json_summary='/workspace/examples/bert/dllogger.json', learning_rate=0.006, local_rank=0, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=20, max_seq_length=128, max_steps=120.0, n_gpu=1, num_steps_per_checkpoint=1000, num_train_epochs=3.0, output_dir='/workspace/examples/bert/results/checkpoints', phase1_end_step=7038, phase2=False, resume_from_checkpoint=False, resume_step=-1, seed=42, skip_checkpoint=False, train_batch_size=48, use_env=False, warmup_proportion=1.0)"]

报错信息： Iteration: 0%| | 0/12776 [00:00<?, ?it/s]Traceback (most recent call last): File "/workspace/examples/bert/run_pretraining.py", line 654, in args, final_loss, train_time_raw, global_step = main() File "/workspace/examples/bert/run_pretraining.py", line 571, in main prediction_scores, seq_relationship_score = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, kwargs) File "/workspace/examples/bert/modeling.py", line 889, in forward encoded_layers, pooled_output = self.bert(input_ids, token_type_ids, attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, kwargs) File "/workspace/examples/bert/modeling.py", line 824, in forward encoded_layers = self.encoder(embedding_output, extended_attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, *kwargs) File "/workspace/examples/bert/modeling.py", line 508, in forward hidden_states = layer_module(hidden_states, attention_mask) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, kwargs) File "/workspace/examples/bert/modeling.py", line 470, in forward intermediate_output = self.intermediate(attention_output) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(*input, *kwargs) File "/workspace/examples/bert/modeling.py", line 443, in forward hidden_states = self.dense_act(hidden_states) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, **kwargs) File "/workspace/examples/bert/modeling.py", line 174, in forward return self.biased_act_fn(self.bias, F.linear(input, self.weight, None)) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.78 GiB total capacity; 14.78 GiB already allocated; 9.44 MiB free; 14.83 GiB reserved in total by PyTorch)

adloph1234 commented 3 years ago

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================|

nlqq commented 3 years ago

这种时候还有一种可能是路径错误，NVIDIA 仓库的代码在路径错误时会显示为显存 OOC，请检查一下所填写的所有路径是否存在，数据集所在路径是否有效。对于一些常见的 Q&A，可在https://zhuanlan.zhihu.com/p/276154597 这篇文章中找到答案。

adloph1234 commented 3 years ago

@nlqq 谢谢 batch size=16时是可以运行的，能排除路径错误的情况。查看了一下nvidia 对pytorch docker的测试情况，https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-on-multiple-nvidia-dgx-1-with-16g，能看到nvidia的报告中batch size也是16。所以想请问我们这个测试是做了什么特别的设置吗？

nlqq commented 3 years ago

还需要修改容器中 /workspace/examples/bert_config.json 文件如下：

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

为了能够在单机上运行bert，部分参数做了如上修改。

Oneflow-Inc / DLPerf

运行bert for pytorch报错Out of memory问题 #118