Open jzp1025 opened 6 years ago
@mxnet-label-bot could you please add [Backend] here?
I think this is actually an issue of zmq, used in ps-lite. Zmq can only take ip address to find network interface. Using hostname will fail. Please take a look at the discussion here: https://stackoverflow.com/questions/6024003/why-doesnt-zeromq-work-on-localhost
thanks!
@mxnet-label-bot add[Distributed]
Description
DMLC_PS_ROOT_URI is ip or hostname , but when use the hostname instead of ip , it reports "bind failed" in the src/van.cc
Environment info (Required)
Package used (Python/R/Scala/Julia): python
Build info (Required if built from source)
gcc
MXNet commit hash: 3df9bf802021d5aa67c609c6736acee94aaf3a48
Build config: the same as doc https://mxnet.apache.org/install/index.html?platform=Linux&language=Python&processor=CPU
Error Message:
(Paste the complete error message, including stack trace.)
[17:46:11] /home/tusimple/incubator-mxnet/dmlc-core/include/dmlc/./logging.h:308: [17:46:11] src/van.cc:76: Check failed: (mynode.port) != (-1) bind failed
Stack trace returned 10 entries: [bt] (0) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f1a283f624c] [bt] (1) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps3Van5StartEv+0x91f) [0x7f1a2af45b8f] [bt] (2) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps6ZMQVan5StartEv+0x4a) [0x7f1a2af504fa] [bt] (3) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps10Postoffice5StartEPKcb+0x1e9) [0x7f1a2af42119] [bt] (4) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN5mxnet7kvstore11KVStoreDist9RunServerERKSt8functionIFviRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEE+0x1c5) [0x7f1a2aee1c35] [bt] (5) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(MXKVStoreRunServer+0x4b) [0x7f1a2ae629db] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f1a41450e40] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f1a414508ab] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f1a416603df] [bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f1a41664d82]
Traceback (most recent call last): File "train_mnist.py", line 25, in
from common import find_mxnet, fit
File "/home/tusimple/incubator-mxnet/example/image-classification/common/find_mxnet.py", line 20, in
import mxnet as mx
File "/home/tusimple/incubator-mxnet/example/image-classification/mxnet/init.py", line 56, in
from . import kvstore_server
File "/home/tusimple/incubator-mxnet/example/image-classification/mxnet/kvstore_server.py", line 85, in
_init_kvstore_server_module()
File "/home/tusimple/incubator-mxnet/example/image-classification/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
server.run()
File "/home/tusimple/incubator-mxnet/example/image-classification/mxnet/kvstore_server.py", line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File "/home/tusimple/incubator-mxnet/example/image-classification/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:46:11] src/van.cc:76: Check failed: (mynode.port) != (-1) bind failed
Stack trace returned 10 entries: [bt] (0) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f1a283f624c] [bt] (1) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps3Van5StartEv+0x91f) [0x7f1a2af45b8f] [bt] (2) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps6ZMQVan5StartEv+0x4a) [0x7f1a2af504fa] [bt] (3) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN2ps10Postoffice5StartEPKcb+0x1e9) [0x7f1a2af42119] [bt] (4) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(_ZN5mxnet7kvstore11KVStoreDist9RunServerERKSt8functionIFviRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEE+0x1c5) [0x7f1a2aee1c35] [bt] (5) /home/tusimple/incubator-mxnet/example/image-classification/mxnet/libmxnet.so(MXKVStoreRunServer+0x4b) [0x7f1a2ae629db] [bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f1a41450e40] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f1a414508ab] [bt] (8) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f1a416603df] [bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f1a41664d82]
Minimum reproducible example
1 scheduler 1 server 1 worker
Steps to reproduce
(Paste the commands you ran that produced the error.)
1.export DMLC_PS_ROOT_URI=tusimple-System-Product-Name; export DMLC_ROLE=scheduler; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; 2.python train_mnist.py
What have you tried to solve it?
1.i replaced the DMLC_PS_ROOT_URI with ip and it works well