分布式GCN运行错误

Sanzo00 commented 2 years ago

我尝试运行 examples/oytorch/gcn/train.py，遇到了如下问题：

# server 1
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py

# server 2
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=172.30.115.32 --master_port=10086 train.py

错误信息如下：

/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
world_size: 2, rank: 0
[2022-07-31 31:29:34.700907] Invalid file path: root://graphlearn
E20220801 07:29:34.700927  2256 env.cc:112] File system not implemented: root://graphlearn
F20220801 07:29:34.701110  2256 fs_coordinator.cc:42] Invalid tracker path: root://graphlearn/
*** Check failure stack trace: ***
    @     0x7f50dfba322e  google::LogMessage::Fail()
    @     0x7f50dfba318b  google::LogMessage::SendToLog()
    @     0x7f50dfba2ad1  google::LogMessage::Flush()
    @     0x7f50dfba6048  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f50dfb5fb15  graphlearn::FSCoordinator::FSCoordinator()
    @     0x7f50dfb58f85  graphlearn::GetCoordinator()
    @     0x7f50dfb2b65e  graphlearn::ServerImpl::RegisterBasicService()
    @     0x7f50dfb2ba00  graphlearn::DefaultServerImpl::Start()
    @     0x7f50e0790380  _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEJEJNS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vJSI_EJS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
    @     0x7f50e0790e7c  pybind11::cpp_function::dispatcher()
    @     0x557ad73d200e  cfunction_call_varargs
    @     0x557ad73c713f  _PyObject_MakeTpCall
    @     0x557ad73fcca0  method_vectorcall
    @     0x557ad7471923  _PyEval_EvalFrameDefault
    @     0x557ad74637e7  _PyFunction_Vectorcall
    @     0x557ad746ce60  _PyEval_EvalFrameDefault
    @     0x557ad74637e7  _PyFunction_Vectorcall
    @     0x557ad746ce60  _PyEval_EvalFrameDefault
    @     0x557ad7462600  _PyEval_EvalCodeWithName
    @     0x557ad7463bc4  _PyFunction_Vectorcall
    @     0x557ad73fcb2e  method_vectorcall
    @     0x557ad746deb0  _PyEval_EvalFrameDefault
    @     0x557ad74637e7  _PyFunction_Vectorcall
    @     0x557ad746d0bb  _PyEval_EvalFrameDefault
    @     0x557ad7462600  _PyEval_EvalCodeWithName
    @     0x557ad7463eb3  PyEval_EvalCode
    @     0x557ad74d8622  run_eval_code_obj
    @     0x557ad74e91d2  run_mod
    @     0x557ad74ec36b  pyrun_file
    @     0x557ad74ec54f  PyRun_SimpleFileExFlags
    @     0x557ad74eca29  Py_RunMain
    @     0x557ad74ecc29  Py_BytesMain
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2256) of binary: /root/miniconda3/envs/graphlearn/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphlearn/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
train.py.bak FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-01_07:29:38
  host      : iZ0jl0791scft6ultff8n0Z
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 2256)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2256
=====================================================

我不知道为什么，加上client_num参数就可以运行了，关于client_num的我看注释时说，用来制定pytorch dataloader worker的数量？

# server 1
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py --client_num=1

# server 2
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=172.30.115.32 --master_port=10086 train.py --client_num=1


 argparser.add_argument('--client_num', type=int, default=0,
                         help="The number of graphlearn client on each pytorch worker, "
                              "which is used as `num_workers` of pytorh dataloader.")

期待您的回复，谢谢！

baoleai commented 2 years ago

In v1.0.1, if client_num < 1, there will be in File system-based synchronization mode, and the error is "[2022-07-31 31:29:34.700907] Invalid file path: root://graphlearn E20220801 07:29:34.700927 2256 env.cc:112] File system not implemented: root://graphlearn F20220801 07:29:34.701110 2256 fs_coordinator.cc:42] Invalid tracker path: root://graphlearn/" This error is strange because the default build wheel package supports the local file system, and may be caused by not clearing the old address sync directory. You can try running gcn example on master branch which uses RPC based address synchronization instead.

Sanzo00 commented 2 years ago

感谢回复，也就是说我可以通过设置client_num来解决这个问题，因为他会通过gl.set_tracker_mode(0)来设置为rpc模式是吗？

另外我还有个疑问？我在用两台机器训练的时候，在每个epoch之后提示Epoch 59 out of range.，这个是程序的bug吗？我试了下，client_num = 1，不会有这个输出信息。

# server 1
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py.bak --client_num=2

# server 2
python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=172.30.115.32 --master_port=10086 train.py.bak --client_num=2

client_num = 1:

baoleai commented 2 years ago

When client_num> 1, there will be multiple clients connecting to a server, and multiple clients consume data on the server asynchronously. So there may be a client's request sent to the server to find the server's status is already out of range (consumed by other clients). This log is not an error or bug and does not affect normal execution.

Sanzo00 commented 2 years ago

明白了，我还有个问题想确认下：我观察到每个server的每个epoch的loss和最后的Test Accuracy不一致，这是因为每个server负责的顶点不同，每个server只计算他负责的那些顶点的loss和accuracy，但是每个server最终得到的参数W应该是一致的吧（由torch DDP负责在每个epoch同步）？

server 1:

server 2:

baoleai commented 2 years ago

Yes, in the current pytorch example, we sampled with graphlearn and then used the model part of PyG directly and trained it with pytorch ddp.

Sanzo00 commented 2 years ago

明白了，感谢您的回复。

alibaba / graph-learn

分布式GCN运行错误 #213