Open Li-Jiajie opened 2 years ago
能详细描述一下您使用的哪个脚本启动,启动命令和参数吗?然后描述一下您使用的paddle版本及paddlerec版本?我们最近这方面的代码有不少改动,没能复现您的问题,需要更详细的复现过程描述。
我在Paddle的项目也问了一下,可以参考这个 https://github.com/PaddlePaddle/Paddle/issues/37346 用的周五刚发布的v2.2.0 启动脚本:fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本:2.2.0(CPU) Python: 3.6.8 CentOS Linux release 7.2 (Final)
我在Paddle的项目也问了一下,可以参考这个 PaddlePaddle/Paddle#37346 用的周五刚发布的v2.2.0 启动脚本:fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本:2.2.0(CPU) Python: 3.6.8 CentOS Linux release 7.2 (Final)
我用paddle官方的镜像试了下同样命令,训练几个batch之后还是core了,报错相同 @yinhaofeng
我这边依然没能复现您的错误,能刚加详细的说明一下您是如何得到这个错误的吗?以及是否更改了代码或配置
我这边依然没能复现您的错误,能刚加详细的说明一下您是如何得到这个错误的吗?以及是否更改了代码或配置
1.直接使用官方 registry.baidubce.com/paddlepaddle/paddle:2.2.0 镜像,CPU版本 2.clone下PaddleRec后,先去dataset下载cretio完整数据集 3.修改model/rank/dlrm/config_bigdata.yaml中的use_gpu为false 4.在PaddleRec根目录运行fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml
您这边用的paddlerec版本是什么,是否有pull最新的代码?
您这边用的paddlerec版本是什么,是否有pull最新的代码?
v2.2.0,最新的代码
单机跑都是正常的,但在ps模式下就core了 机器96C 256G,内存是够的
我之前自己实现了一个dlrm,也core了
您运行的时候会产生log目录,麻烦截图给我们workerlog.0,serverlog.0,以及屏幕输出。如果全量数据需要时间较多,尝试demo数据是否会出现同样的bug
registry.baidubce.com/paddlepaddle/paddle:2.2.0
server.0: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+=======================================================================================+
| PaddleRec Benchmark Envs Value |
| hyper_parameters.bot_layer_sizes [512, 256, 64, 16] |
| runner.epochs 1 |
| runner.infer_batch_size 2048 |
| runner.infer_start_epoch 0 |
| runner.model_save_path output_model_dlrm |
| runner.print_interval 100 |
| runner.split_file_list False |
| runner.sync_mode async |
| runner.test_data_dir ../../../datasets/criteo/slot_test_data_full |
| runner.thread_num 1 |
| runner.use_gpu False |
"When training, we now always track global mean and variance.")
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[methodname]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: PSERVER --
INFO:main:Run Server Begin
I1122 08:57:55.800235 10444 brpc_ps_server.cc:65] running server with rank id: 0, endpoint: 127.0.0.1:36520
0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()
1 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::distributed::SAdam::update(unsigned long const, float const, unsigned long, std::vector<unsigned long, std::allocator
FatalError: Segmentation fault
is detected by the operating system.
[TimeInfo: Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ]
[SignalInfo: SIGSEGV (@0x0) received by PID 10444 (TID 0x7f345eff5700) from PID 0 ]
worker.0
INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-130 has 200000 examples
INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-160 has 200000 examples
INFO:utils.static_ps.reader_helper:Total example: 44000000
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[methodname]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: TRAINER --
INFO:main:Run Worker Begin
INFO:main:Epoch: 0, Running RecDatast Begin.
INFO:main:Epoch: 0, Batch_id: 0, cost: [0.8848212], auc: [0.50078756], avg_reader_cost: 0.00240 sec, avg_batch_cost: 0.00396 sec, avg_samples: 20.48000, ips: 5166.07934 example/sec
INFO:main:Epoch: 0, Batch_id: 100, cost: [0.5674137], auc: [0.64664944], avg_reader_cost: 0.17707 sec, avg_batch_cost: 0.24327 sec, avg_samples: 2048.00000, ips: 8418.67772 example/sec
W1122 08:58:57.628669 11033 input_messenger.cpp:222] Fail to read from fd=19 SocketId=212@127.0.0.1:51669@46133: Connection reset by peer [104]
W1122 08:58:57.628697 11045 input_messenger.cpp:222] Fail to read from fd=21 SocketId=320@127.0.0.1:51669@46134: Connection reset by peer [104]
W1122 08:58:57.628742 11044 input_messenger.cpp:222] Fail to read from fd=8 SocketId=108@127.0.0.1:51669@46073: Connection reset by peer [104]
I1122 08:58:57.729051 11016 socket.cpp:2370] Checking SocketId=1@127.0.0.1:51669
W1122 08:58:57.759953 11007 input_messenger.cpp:222] Fail to read from fd=13 SocketId=210@127.0.0.1:36520@61550: Connection reset by peer [104]
W1122 08:58:57.759985 11033 input_messenger.cpp:222] Fail to read from fd=17 SocketId=318@127.0.0.1:36520@61551: Connection reset by peer [104]
E1122 08:58:57.760069 11010 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=13 SocketId=210@127.0.0.1:36520@61550: Connection reset by peer [R1][E111]Fail to connect SocketId=17179869191@127.0.0.1:36520: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760084 11029 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=12 SocketId=102@127.0.0.1:36520@61423 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760102 11044 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=10 SocketId=312@127.0.0.1:36520@61519 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760108 11074 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760123 11029 brpc_ps_client.cc:194] resquest cmd_id:3 failed, err:[E1014]Got EOF of fd=14 SocketId=306@127.0.0.1:36520@61425 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760124 11073 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760130 10453 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760159 11045 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=17 SocketId=318@127.0.0.1:36520@61551: Connection reset by peer [R1][E111]Fail to connect SocketId=8589934804@127.0.0.1:36520@46133: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
第二个batch后就停止了
看报错应该是没连接上server,你看一下其他的server有没有running server with rank id: 0, endpoint: 127.0.0.1:36520这种描述。
连接上server,你看一下其他的server有没有runni
worker的报错是没连接上server,server的报错是core了 那根源应该还是server core的原因吧
这事另一个server的log
cat serverlog.1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+=======================================================================================+
| PaddleRec Benchmark Envs Value |
+---------------------------------------------------------------------------------------+
| config_abs_dir ... /docker/docker/PaddleRec/models/rank/dlrm|
| hyper_parameters.bot_layer_sizes [512, 256, 64, 16] |
| hyper_parameters.dense_input_dim 13 |
| hyper_parameters.num_field 26 |
| hyper_parameters.optimizer.class SGD |
|hyper_parameters.optimizer.learning_rate 0.1 |
| hyper_parameters.optimizer.strategy async |
| hyper_parameters.sparse_feature_dim 16 |
| hyper_parameters.sparse_feature_number 1000001 |
| hyper_parameters.sparse_inputs_slots 27 |
| hyper_parameters.top_layer_sizes [512, 256, 2] |
| runner.epochs 1 |
| runner.infer_batch_size 2048 |
| runner.infer_end_epoch 1 |
| runner.infer_load_path output_model_dlrm |
| runner.infer_reader_path criteo_reader |
| runner.infer_start_epoch 0 |
| runner.model_save_path output_model_dlrm |
| runner.print_interval 100 |
| runner.split_file_list False |
| runner.sync_mode async |
| runner.test_data_dir ../../../datasets/criteo/slot_test_data_full |
| runner.thread_num 1 |
| runner.train_batch_size 2048 |
| runner.train_data_dir ... ./../datasets/criteo/slot_train_data_full|
| runner.train_reader_path criteo_reader |
| runner.use_auc True |
| runner.use_gpu False |
| yaml_path models/rank/dlrm/config_bigdata.yaml |
+=======================================================================================+
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[methodname]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool
is a deprecated alias for the builtin bool
. To silence this warning, use bool
by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: PSERVER --
INFO:main:Run Server Begin
I1122 08:57:55.766316 10447 brpc_ps_server.cc:65] running server with rank id: 1, endpoint: 127.0.0.1:51669
0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()
1 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::distributed::SAdam::update(unsigned long const, float const, unsigned long, std::vector<unsigned long, std::allocator
FatalError: Segmentation fault
is detected by the operating system.
[TimeInfo: Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ]
[SignalInfo: SIGSEGV (@0x0) received by PID 10447 (TID 0x7efd44630700) from PID 0 ]
再次确认一下,是否修改了paddlerec的代码,比如说在组网的embedding中增加了padding_idx参数或者修改了数据处理脚本中的padding值。
我重新下载了最新镜像,问题已经解决。之前使用的镜像是一个月前下载的,可能某些版本不兼容。感谢。
另外咨询一下,我想通过fleetrun在一台机器上启动多个节点,并且这些节点是在多个容器中启动的,是否可以支持? 目前看fleetrun会一次性启动当前ip所需的所有节点,能不能让运行这个命令的时候,只启动一个节点呢
暂不支持哦
当DistributedStrategy的a_sync=True时,server会core dump,a_sync=False时就正常 2server、3worker,CPU模式,Adam和SGD都会在如图位置core,请问是什么问题呢