dbiir / UER-py

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
https://github.com/dbiir/UER-py/wiki
Apache License 2.0
2.98k stars 528 forks source link

deepspeed多机多卡训练bert错误 #198

Closed RyanHuangNLP closed 2 years ago

RyanHuangNLP commented 3 years ago

这是我的shell 脚本

export NCCL_IB_CUDA_SUPPORT=1 && export NCCL_DEBUG=INFO && export NCCL_IB_DISABLE=0 && export NCCL_IB_GID_INDEX=3 && deepspeed \
          --hostfile=./hostfile \
         pretrain.py \
          --deepspeed --deepspeed_config models/deepspeed_config.json \
          --dataset_path dataset_wwm.pt \
          --vocab_path models/google_zh_vocab.txt \
          --config_path models/bert/large_config.json \
          --output_model_path models/roberta_large_seq128_model.bin \
          --world_size 16 --total_steps 4000000 --save_checkpoint_steps 128 --report_steps 128 \
          --learning_rate 1e-4 --batch_size 16 \
          --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights 

deepspeed的config文件

{
  "train_batch_size": 256,
  "train_micro_batch_size_per_gpu": 16,
  "steps_per_print": 128,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 0.01
    }
  },
  "flops_profiler": {
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "wall_clock_breakdown": false
}

出现以下的错误

b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:   File "/data/ceph_11015/ssd/ramseyhuang/UER/uer/utils/config.py", line 21, in <dictcomp>\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=4'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=1'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=7'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=6'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=2'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=3'\n"
b"11.213.17.148: KeyError: 'local_rank=0'\n"
b'11.213.17.148:     input_args = {k: args_dict[k] for k in [a[2:] for a in sys.argv if a[:2] == "--"]}\n'
b"11.213.17.148: KeyError: 'local_rank=5'\n"
b'11.213.17.148: Killing subprocess 2355\n'
b'11.213.17.148: Killing subprocess 2356\n'
b'11.213.17.148: Killing subprocess 2357\n'
b'11.213.17.148: Killing subprocess 2358\n'
b'11.213.17.148: Killing subprocess 2359\n'
b'11.213.17.148: Killing subprocess 2360\n'
b'11.213.17.148: Killing subprocess 2361\n'
b'11.213.17.148: Killing subprocess 2362\n'
b'11.213.17.148: Traceback (most recent call last):\n'
b'11.213.17.148:   File "/data/miniconda3/envs/env-3.6.8/lib/python3.6/runpy.py", line 193, in _run_module_as_main\n'
b'11.213.17.148:     "__main__", mod_spec)\n'
b'11.213.17.148:   File "/data/miniconda3/envs/env-3.6.8/lib/python3.6/runpy.py", line 85, in _run_code\n'
b'11.213.17.148:     exec(code, run_globals)\n'
b'11.213.17.148:   File "/data/miniconda3/envs/env-3.6.8/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>\n'
b'11.213.17.148:     main()\n'
b'11.213.17.148:   File "/data/miniconda3/envs/env-3.6.8/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main\n'
b'11.213.17.148:     sigkill_handler(signal.SIGTERM, None)  # not coming back\n'
b'11.213.17.148:   File "/data/miniconda3/envs/env-3.6.8/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler\n'
b'11.213.17.148:     raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)\n'

主要问题是KeyError: 'local_rank=4'\n"?

hhou435 commented 3 years ago

您好,这里是参数格式带来的bug,已经修复,感谢对项目的关注