PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.09k stars 5.55k forks source link

多机参数服务器分布式训练出错 #47326

Open Eddie-zhg opened 1 year ago

Eddie-zhg commented 1 year ago

请提出你的问题 Please ask your question

检测多机通信情况,显示是正常的: server: 2022-10-25 15-39-51屏幕截图 2022-10-25 15-40-07屏幕截图 worker: 2022-10-25 15-40-44屏幕截图 2022-10-25 15-41-03屏幕截图 运行时会出错: $ python -m paddle.distributed.launch --master=169.254.60.61:60437 --nnodes=2 train.py LAUNCH WARNING 2022-10-25 15:17:05,740 Host ip reset to 169.254.60.61 LAUNCH INFO 2022-10-25 15:17:05,740 ----------- Configuration ---------------------- LAUNCH INFO 2022-10-25 15:17:05,740 devices: None LAUNCH INFO 2022-10-25 15:17:05,740 elastic_level: -1 LAUNCH INFO 2022-10-25 15:17:05,740 elastic_timeout: 30 LAUNCH INFO 2022-10-25 15:17:05,740 gloo_port: 6767 LAUNCH INFO 2022-10-25 15:17:05,740 host: 169.254.60.61 LAUNCH INFO 2022-10-25 15:17:05,741 job_id: default LAUNCH INFO 2022-10-25 15:17:05,741 legacy: False LAUNCH INFO 2022-10-25 15:17:05,741 log_dir: log LAUNCH INFO 2022-10-25 15:17:05,741 log_level: INFO LAUNCH INFO 2022-10-25 15:17:05,741 master: 169.254.60.61:60437 LAUNCH INFO 2022-10-25 15:17:05,741 max_restart: 3 LAUNCH INFO 2022-10-25 15:17:05,741 nnodes: 2 LAUNCH INFO 2022-10-25 15:17:05,741 nproc_per_node: None LAUNCH INFO 2022-10-25 15:17:05,741 rank: -1 LAUNCH INFO 2022-10-25 15:17:05,741 run_mode: collective LAUNCH INFO 2022-10-25 15:17:05,741 server_num: None LAUNCH INFO 2022-10-25 15:17:05,741 servers: LAUNCH INFO 2022-10-25 15:17:05,741 trainer_num: None LAUNCH INFO 2022-10-25 15:17:05,741 trainers: LAUNCH INFO 2022-10-25 15:17:05,741 training_script: train.py LAUNCH INFO 2022-10-25 15:17:05,741 training_script_args: [] LAUNCH INFO 2022-10-25 15:17:05,741 with_gloo: 0 LAUNCH INFO 2022-10-25 15:17:05,741 -------------------------------------------------- LAUNCH INFO 2022-10-25 15:17:05,745 Job: default, mode collective, replicas 2[2:2], elastic False LAUNCH INFO 2022-10-25 15:17:05,745 Waiting peer start... LAUNCH INFO 2022-10-25 15:17:09,023 Run Pod: fpwuxo, replicas 1, status ready LAUNCH INFO 2022-10-25 15:17:09,035 Watching Pod: fpwuxo, replicas 1, status running /home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/executor.py:400: UserWarning: do not use standalone executor in fleet by default warnings.warn("do not use standalone executor in fleet by default") /home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/base/fleet_base.py:125: UserWarning: init_worker() function doesn't work when use non_distributed fleet. warnings.warn( device worker program id: 139891026931088 I1025 15:17:09.854562 159406 multi_trainer.cc:164] MultiTrainer::InitOtherEnv Communicator is null! terminate called after throwing an instance of 'phi::enforce::EnforceNotMet' what(): In user code:

File "train.py", line 10, in <module>
  model.net(is_train=True)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 177, in net
  pred = wide_deep_model.forward(sparse_inputs, dense_input)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 58, in forward
  emb = paddle.static.nn.sparse_embedding(s_input, size = [1024, self.sparse_feature_dim], param_attr=paddle.ParamAttr(name="embedding"))
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/contrib/layers/nn.py", line 1188, in sparse_embedding
  helper.append_op(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/layer_helper.py", line 44, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 3615, in append_op
  op = Operator(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2635, in __init__
  for frame in traceback.extract_stack():

NotFoundError: Input id (227854) is not in current rows table. (at /paddle/paddle/phi/core/selected_rows_impl.h:84)
  [operator < lookup_table > error]

C++ Traceback (most recent call last):

0 paddle::framework::HogwildWorker::TrainFiles()


Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1666682230 (unix time) try "date -d @1666682230" if you are using GNU date ] [SignalInfo: SIGABRT (@0x3e800026eae) received by PID 159406 (TID 0x7f3ae8ec4700) from PID 159406 ]

LAUNCH INFO 2022-10-25 15:17:31,067 Pod failed LAUNCH ERROR 2022-10-25 15:17:31,067 Container failed !!! Container rank 0 status failed cmd ['/usr/bin/python3.8', '-u', 'train.py'] code -6 log log/default.fpwuxo.0.log env {'SHELL': '/bin/bash', 'SESSION_MANAGER': 'local/ubuntu-Precision-5820-Tower-X-Series:@/tmp/.ICE-unix/1785,unix/ubuntu-Precision-5820-Tower-X-Series:/tmp/.ICE-unix/1785', 'QT_ACCESSIBILITY': '1', 'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg', 'XDG_MENU_PREFIX': 'gnome-', 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', '_CE_M': '', 'TERMINAL_EMULATOR': 'JetBrains-JediTerm', 'LANGUAGE': 'zh_CN:en_US:en', 'LC_ADDRESS': 'zh_CN.UTF-8', 'GNOME_SHELL_SESSION_MODE': 'ubuntu', 'LC_NAME': 'zh_CN.UTF-8', 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh', 'TERM_SESSION_ID': 'c2edaa9a-8094-4523-8d8f-2bb4e93050cf', 'XMODIFIERS': '@im=ibus', 'DESKTOP_SESSION': 'ubuntu', 'LC_MONETARY': 'zh_CN.UTF-8', 'SSH_AGENT_PID': '1750', 'GTK_MODULES': 'gail:atk-bridge', 'PWD': '/home/ubuntu/桌面/wide_and_deep_dataset', 'XDG_SESSION_DESKTOP': 'ubuntu 'LOGNAME': 'ubuntu', 'XDG_SESSION_TYPE': 'x11', 'CONDA_PREFIX': '/home/ubuntu/anaconda3', 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1', 'XAUTHORITY': '/run/user/1000/gdm/Xauthority', 'DESKTOP_STARTUP_ID': 'gnome-shell/PyCharm Professional Edition/1799-63-ubuntu-Precision-5820-Tower-X-Series_TIME884648996', 'GJS_DEBUG_TOPICS': 'JS ERROR;JS LOG', 'WINDOWPATH': '2', 'HOME': '/home/ubuntu', 'USERNAME': 'ubuntu', 'IM_CONFIG_PHASE': '1', 'LANG': 'zh_CN.UTF-8', 'LC_PAPER': 'zh_CN.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'XDG_CURRENT_DESKTOP': 'ubuntu:GNOME', 'VIRTUAL_ENV': '/home/ubuntu/venv', 'CONDA_PROMPT_MODIFIER': '(base) ', 'INVOCATION_ID': 'e657d89aa279488b9eef4a9e2db9b1b7', 'MANAGERPID': '1563', 'GJS_DEBUG_OUTPUT': 'stderr', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'XDG_SESSION_CLASS': 'user', 'TERM': 'xterm-256color', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', '_CE_CONDA': '', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'USER': 'ubuntu', 'CONDA_SHLVL': '1', 'DISPLAY': ':1', 'SHLVL': '1', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'QT_IM_MODULE': 'ibus', 'LC_MEASUREMENT': 'zh_CN.UTF-8', 'PAPERSIZE': 'a4', 'POD_IP': '169.254.60.61', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'LD_LIBRARY_PATH': '/usr/local/cuda/lib64:/usr/local/lib:/home/ubuntu/nccl_2.8.4-1+cuda11.2_x86_64/include/:/~/nccl_2.8.4-1+cuda11.2_x86_64/lib', 'XDG_RUNTIME_DIR': '/run/user/1000', 'PS1': '(venv) (base) \[\e]0;\u@\h: \w\a\]${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ ', 'CONDA_DEFAULT_ENV': 'base', 'LC_TIME': 'zh_CN.UTF-8', 'JOURNAL_STREAM': '8:54528', 'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop', 'PATH': '/home/ubuntu/venv/bin:/home/ubuntu/anaconda3/bin:/home/ubuntu/anaconda3/condabin:/usr/local/cuda/bin:/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ubuntu/anaconda3/bin', 'GDMSESSION': 'ubuntu', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'GIO_LAUNCHED_DESKTOP_FILE_PID': '151672', 'GIO_LAUNCHED_DESKTOP_FILE': '/usr/share/applications/jetbrains-pycharm.desktop', 'LC_NUMERIC': 'zhCN.UTF-8', '': '/usr/bin/python3.8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'PADDLE_MASTER': '169.254.60.61:46249', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_TRAINER_ENDPOINTS': '169.254.60.61:35173,127.0.1.1:48943', 'PADDLE_CURRENT_ENDPOINT': '169.254.60.61:35173', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_cpus': ''} /home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/executor.py:400: UserWarning: do not use standalone executor in fleet by default warnings.warn("do not use standalone executor in fleet by default") /home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/base/fleet_base.py:125: UserWarning: init_worker() function doesn't work when use non_distributed fleet. warnings.warn( device worker program id: 139891026931088 I1025 15:17:09.854562 159406 multi_trainer.cc:164] MultiTrainer::InitOtherEnv Communicator is null! terminate called after throwing an instance of 'phi::enforce::EnforceNotMet' what(): In user code:

File "train.py", line 10, in <module>
  model.net(is_train=True)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 177, in net
  pred = wide_deep_model.forward(sparse_inputs, dense_input)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 58, in forward
  emb = paddle.static.nn.sparse_embedding(s_input, size = [1024, self.sparse_feature_dim], param_attr=paddle.ParamAttr(name="embedding"))
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/contrib/layers/nn.py", line 1188, in sparse_embedding
  helper.append_op(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/layer_helper.py", line 44, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 3615, in append_op
  op = Operator(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2635, in __init__
  for frame in traceback.extract_stack():

NotFoundError: Input id (227854) is not in current rows table. (at /paddle/paddle/phi/core/selected_rows_impl.h:84)
  [operator < lookup_table > error]

C++ Traceback (most recent call last):

0 paddle::framework::HogwildWorker::TrainFiles()


Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1666682230 (unix time) try "date -d @1666682230" if you are using GNU date ] [SignalInfo: SIGABRT (@0x3e800026eae) received by PID 159406 (TID 0x7f3ae8ec4700) from PID 159406 ]

LAUNCH INFO 2022-10-25 15:17:31,561 Exit code -6

运行的是paddle2.3.2 cpu版本 请问这个是哪里出问题了?

paddle-bot[bot] commented 1 year ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

Eddie-zhg commented 1 year ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

有人回复么