PS异步训练模式下server出现core

Li-Jiajie commented 2 years ago

当DistributedStrategy的a_sync=True时，server会core dump，a_sync=False时就正常 2server、3worker，CPU模式，Adam和SGD都会在如图位置core，请问是什么问题呢

yinhaofeng commented 2 years ago

能详细描述一下您使用的哪个脚本启动，启动命令和参数吗？然后描述一下您使用的paddle版本及paddlerec版本？我们最近这方面的代码有不少改动，没能复现您的问题，需要更详细的复现过程描述。

Li-Jiajie commented 2 years ago

我在Paddle的项目也问了一下，可以参考这个 https://github.com/PaddlePaddle/Paddle/issues/37346 用的周五刚发布的v2.2.0 启动脚本：fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本：2.2.0（CPU） Python: 3.6.8 CentOS Linux release 7.2 (Final)

Li-Jiajie commented 2 years ago

我在Paddle的项目也问了一下，可以参考这个 PaddlePaddle/Paddle#37346 用的周五刚发布的v2.2.0 启动脚本：fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本：2.2.0（CPU） Python: 3.6.8 CentOS Linux release 7.2 (Final)

我用paddle官方的镜像试了下同样命令，训练几个batch之后还是core了，报错相同 @yinhaofeng

yinhaofeng commented 2 years ago

我这边依然没能复现您的错误，能刚加详细的说明一下您是如何得到这个错误的吗？以及是否更改了代码或配置

Li-Jiajie commented 2 years ago

我这边依然没能复现您的错误，能刚加详细的说明一下您是如何得到这个错误的吗？以及是否更改了代码或配置

1.直接使用官方 registry.baidubce.com/paddlepaddle/paddle:2.2.0 镜像，CPU版本 2.clone下PaddleRec后，先去dataset下载cretio完整数据集 3.修改model/rank/dlrm/config_bigdata.yaml中的use_gpu为false 4.在PaddleRec根目录运行fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml

yinhaofeng commented 2 years ago

您这边用的paddlerec版本是什么，是否有pull最新的代码？

Li-Jiajie commented 2 years ago

您这边用的paddlerec版本是什么，是否有pull最新的代码？

v2.2.0，最新的代码

单机跑都是正常的，但在ps模式下就core了机器96C 256G，内存是够的

我之前自己实现了一个dlrm，也core了

yinhaofeng commented 2 years ago

您运行的时候会产生log目录，麻烦截图给我们workerlog.0，serverlog.0，以及屏幕输出。如果全量数据需要时间较多，尝试demo数据是否会出现同样的bug

Li-Jiajie commented 2 years ago

registry.baidubce.com/paddlepaddle/paddle:2.2.0

server.0: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

+=======================================================================================+
|                PaddleRec Benchmark Envs                      Value                    |
|        hyper_parameters.bot_layer_sizes               [512, 256, 64, 16]              |
|                           runner.epochs                        1                      |
|                 runner.infer_batch_size                      2048                     |
|                runner.infer_start_epoch                        0                      |
|                  runner.model_save_path                output_model_dlrm              |
|                   runner.print_interval                       100                     |
|                  runner.split_file_list                      False                    |
|                        runner.sync_mode                      async                    |
|                    runner.test_data_dir  ../../../datasets/criteo/slot_test_data_full |
|                       runner.thread_num                        1                      |
|                          runner.use_gpu                      False                    |

"When training, we now always track global mean and variance.") The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future. op_type, op_type, EXPRESSION_MAP[methodname])) /usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations elif dtype == np.bool: INFO:main:cpu_num: 4 INFO:common:-- Role: PSERVER -- INFO:main:Run Server Begin I1122 08:57:55.800235 10444 brpc_ps_server.cc:65] running server with rank id: 0, endpoint: 127.0.0.1:36520

C++ Traceback (most recent call last):

0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() 1 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool) 2 paddle::distributed::SAdam::update(unsigned long const, float const, unsigned long, std::vector<unsigned long, std::allocator > const&, paddle::distributed::ValueBlock*)

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 10444 (TID 0x7f345eff5700) from PID 0 ]

worker.0 INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-130 has 200000 examples INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-160 has 200000 examples INFO:utils.static_ps.reader_helper:Total example: 44000000 /usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance. "When training, we now always track global mean and variance.") /usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103 The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future. op_type, op_type, EXPRESSION_MAP[methodname])) /usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations elif dtype == np.bool: INFO:main:cpu_num: 4 INFO:common:-- Role: TRAINER -- INFO:main:Run Worker Begin INFO:main:Epoch: 0, Running RecDatast Begin. INFO:main:Epoch: 0, Batch_id: 0, cost: [0.8848212], auc: [0.50078756], avg_reader_cost: 0.00240 sec, avg_batch_cost: 0.00396 sec, avg_samples: 20.48000, ips: 5166.07934 example/sec INFO:main:Epoch: 0, Batch_id: 100, cost: [0.5674137], auc: [0.64664944], avg_reader_cost: 0.17707 sec, avg_batch_cost: 0.24327 sec, avg_samples: 2048.00000, ips: 8418.67772 example/sec W1122 08:58:57.628669 11033 input_messenger.cpp:222] Fail to read from fd=19 SocketId=212@127.0.0.1:51669@46133: Connection reset by peer [104] W1122 08:58:57.628697 11045 input_messenger.cpp:222] Fail to read from fd=21 SocketId=320@127.0.0.1:51669@46134: Connection reset by peer [104] W1122 08:58:57.628742 11044 input_messenger.cpp:222] Fail to read from fd=8 SocketId=108@127.0.0.1:51669@46073: Connection reset by peer [104] I1122 08:58:57.729051 11016 socket.cpp:2370] Checking SocketId=1@127.0.0.1:51669 W1122 08:58:57.759953 11007 input_messenger.cpp:222] Fail to read from fd=13 SocketId=210@127.0.0.1:36520@61550: Connection reset by peer [104] W1122 08:58:57.759985 11033 input_messenger.cpp:222] Fail to read from fd=17 SocketId=318@127.0.0.1:36520@61551: Connection reset by peer [104] E1122 08:58:57.760069 11010 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=13 SocketId=210@127.0.0.1:36520@61550: Connection reset by peer [R1][E111]Fail to connect SocketId=17179869191@127.0.0.1:36520: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet E1122 08:58:57.760084 11029 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=12 SocketId=102@127.0.0.1:36520@61423 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet E1122 08:58:57.760102 11044 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=10 SocketId=312@127.0.0.1:36520@61519 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet E1122 08:58:57.760108 11074 fleet.cc:296] fleet pull sparse failed, status[-1] E1122 08:58:57.760123 11029 brpc_ps_client.cc:194] resquest cmd_id:3 failed, err:[E1014]Got EOF of fd=14 SocketId=306@127.0.0.1:36520@61425 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet E1122 08:58:57.760124 11073 fleet.cc:296] fleet pull sparse failed, status[-1] E1122 08:58:57.760130 10453 fleet.cc:296] fleet pull sparse failed, status[-1] E1122 08:58:57.760159 11045 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=17 SocketId=318@127.0.0.1:36520@61551: Connection reset by peer [R1][E111]Fail to connect SocketId=8589934804@127.0.0.1:36520@46133: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet

第二个batch后就停止了

yinhaofeng commented 2 years ago

看报错应该是没连接上server，你看一下其他的server有没有running server with rank id: 0, endpoint: 127.0.0.1:36520这种描述。

Li-Jiajie commented 2 years ago

连接上server，你看一下其他的server有没有runni

worker的报错是没连接上server，server的报错是core了那根源应该还是server core的原因吧

这事另一个server的log

cat serverlog.1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

+=======================================================================================+
|                PaddleRec Benchmark Envs                      Value                    |
+---------------------------------------------------------------------------------------+
|                          config_abs_dir  ... /docker/docker/PaddleRec/models/rank/dlrm|
|        hyper_parameters.bot_layer_sizes               [512, 256, 64, 16]              |
|        hyper_parameters.dense_input_dim                       13                      |
|              hyper_parameters.num_field                       26                      |
|        hyper_parameters.optimizer.class                       SGD                     |
|hyper_parameters.optimizer.learning_rate                       0.1                     |
|     hyper_parameters.optimizer.strategy                      async                    |
|     hyper_parameters.sparse_feature_dim                       16                      |
|  hyper_parameters.sparse_feature_number                     1000001                   |
|    hyper_parameters.sparse_inputs_slots                       27                      |
|        hyper_parameters.top_layer_sizes                  [512, 256, 2]                |
|                           runner.epochs                        1                      |
|                 runner.infer_batch_size                      2048                     |
|                  runner.infer_end_epoch                        1                      |
|                  runner.infer_load_path                output_model_dlrm              |
|                runner.infer_reader_path                  criteo_reader                |
|                runner.infer_start_epoch                        0                      |
|                  runner.model_save_path                output_model_dlrm              |
|                   runner.print_interval                       100                     |
|                  runner.split_file_list                      False                    |
|                        runner.sync_mode                      async                    |
|                    runner.test_data_dir  ../../../datasets/criteo/slot_test_data_full |
|                       runner.thread_num                        1                      |
|                 runner.train_batch_size                      2048                     |
|                   runner.train_data_dir  ... ./../datasets/criteo/slot_train_data_full|
|                runner.train_reader_path                  criteo_reader                |
|                          runner.use_auc                      True                     |
|                          runner.use_gpu                      False                    |
|                               yaml_path      models/rank/dlrm/config_bigdata.yaml     |
+=======================================================================================+

/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance. "When training, we now always track global mean and variance.") /usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103 The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future. op_type, op_type, EXPRESSION_MAP[methodname])) /usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations elif dtype == np.bool: INFO:main:cpu_num: 4 INFO:common:-- Role: PSERVER -- INFO:main:Run Server Begin I1122 08:57:55.766316 10447 brpc_ps_server.cc:65] running server with rank id: 1, endpoint: 127.0.0.1:51669

C++ Traceback (most recent call last):

0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() 1 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool) 2 paddle::distributed::SAdam::update(unsigned long const, float const, unsigned long, std::vector<unsigned long, std::allocator > const&, paddle::distributed::ValueBlock*)

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 10447 (TID 0x7efd44630700) from PID 0 ]

esythan commented 2 years ago

再次确认一下，是否修改了paddlerec的代码，比如说在组网的embedding中增加了padding_idx参数或者修改了数据处理脚本中的padding值。

Li-Jiajie commented 2 years ago

我重新下载了最新镜像，问题已经解决。之前使用的镜像是一个月前下载的，可能某些版本不兼容。感谢。

另外咨询一下，我想通过fleetrun在一台机器上启动多个节点，并且这些节点是在多个容器中启动的，是否可以支持？目前看fleetrun会一次性启动当前ip所需的所有节点，能不能让运行这个命令的时候，只启动一个节点呢

yinhaofeng commented 2 years ago

暂不支持哦

PaddlePaddle / PaddleRec