alibaba / FastNN

FastNN provides distributed training examples that use EPL.
Apache License 2.0
81 stars 19 forks source link

resnet示例nccl_communicator报错 #15

Closed wind818 closed 11 months ago

wind818 commented 1 year ago

环境:

基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器

脚本:

FastNN的resnet脚本

启动命令

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

报错

2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.397497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.403631: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.433142: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[{{node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater}}]]

Traceback (most recent call last):
  File "resnet_dp.py", line 92, in <module>
    run_model()
  File "resnet_dp.py", line 67, in run_model
    with tf.train.MonitoredTrainingSession(hooks=hooks) as sess:
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
    res = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
    assign_ops = _init_local_resources(self, fn)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 423, in _init_local_resources
    fn(self, local_resources_init_op)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
SeaOfOcean commented 1 year ago

第二个worker的index应该是1

wind818 commented 1 year ago

第二个worker的index应该是1

是我自己填错了,实际运行时是第二个worker的inddex是1,这个nccl问题在我测试了多个镜像都存在

wind818 commented 1 year ago

第二个worker的index应该是1

镜像是按照官网的安装提示制作的,并且代码和启动命令,都是按照文档中步骤测试的,能否提供一个能流程运行改代码的镜像

SueeH commented 11 months ago

第二个worker的index应该是1

镜像是按照官网的安装提示制作的,并且代码和启动命令,都是按照文档中步骤测试的,能否提供一个能流程运行改代码的镜像 请问解决了吗?我测试也遇到相同问题。

adoda commented 11 months ago

如果是在一台机器上启动两个worker,可以为每个worker划分下GPU列表,配置CUDA_VISIBLE_DEVICES变量。 比如第一个worker用GPU 0,1,第二个worker用GPU 2,3,可以在bash命令前加上 CUDA_VISIBLE_DEVICES=0,1 或CUDA_VISIBLE_DEVICES=2,3

SeaOfOcean commented 11 months ago

@SueeH 反馈export NCCL_SOCKET_IFNAME 到正确的地址 可以跑

SueeH commented 11 months ago

@SueeH 反馈export NCCL_SOCKET_IFNAME 到正确的地址 可以跑 正解,问题已解决