KVServer and KVClient on different machines fail to initialize data

chhzh123 commented 3 years ago

Seems like the distributed KVStore module has not been thoroughly tested. I am able to start KVServer and KVClient on the same machine to do data communication. However, if I start the KVServer and the KVClient on different machines, the data cannot be correctly initialized.

To Reproduce

Steps to reproduce the behavior:

Use the example function (test_kv_store) in test_new_kvstore.py, and configure the server and client on different machines.

The traceback is shown below. As you can see, the client is successfully connected to the server. However, the client cannot create the data_0 tensor since it looks up the shared memory on the current machine while this tensor should have been initialized on the other machine.

Machine (0) client (0) connect to server successfuly!
Machine (0) client (1) connect to server successfuly!
Process SpawnProcess-2:
Process SpawnProcess-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/tiger/gnn_example/ps/test_kvstore.py", line 159, in start_client
    kvclient.map_shared_data(partition_book=gpb)
  File "/opt/tiger/gnn_example/ps/test_kvstore.py", line 159, in start_client
    kvclient.map_shared_data(partition_book=gpb)
  File "/usr/local/lib/python3.8/site-packages/dgl/distributed/kvstore.py", line 1069, in map_shared_data
    shared_data = empty_shared_mem(name+'-kvdata-', False, shape, dtype)
  File "/usr/local/lib/python3.8/site-packages/dgl/distributed/kvstore.py", line 1069, in map_shared_data
    shared_data = empty_shared_mem(name+'-kvdata-', False, shape, dtype)
  File "/usr/local/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 143, in empty_shared_mem
    check_call(_LIB.DGLArrayAllocSharedMem(
  File "/usr/local/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 143, in empty_shared_mem
    check_call(_LIB.DGLArrayAllocSharedMem(
  File "/usr/local/lib/python3.8/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
  File "/usr/local/lib/python3.8/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [17:33:31] /opt/dgl/src/runtime/shared_mem.cc:67: Check failed: fd != -1 (-1 vs. -1) : fail to open data_0-kvdata-: No such file or directory
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f1e12702a4f]
  [bt] (1) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::SharedMemory::Open(unsigned long)+0x182) [0x7f1e12dbbc42]
  [bt] (2) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::EmptyShared(std::string const&, std::vector<long, std::allocator<long> >, DLDataType, DLContext, bool)+0x1ed) [0x7f1e12db3c3d]
  [bt] (3) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(DGLArrayAllocSharedMem+0x125) [0x7f1e12db43c5]
  [bt] (4) /usr/local/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1e8bf2b9dd]
  [bt] (5) /usr/local/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1e8bf2b067]
  [bt] (6) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x1097a) [0x7f1e8bbf997a]
  [bt] (7) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x110db) [0x7f1e8bbfa0db]
  [bt] (8) /usr/local/bin/python3(_PyObject_MakeTpCall+0x22f) [0x55f6e70e450f]

dgl._ffi.base.DGLError: [17:33:31] /opt/dgl/src/runtime/shared_mem.cc:67: Check failed: fd != -1 (-1 vs. -1) : fail to open data_0-kvdata-: No such file or directory
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f8774665a4f]
  [bt] (1) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::SharedMemory::Open(unsigned long)+0x182) [0x7f8774d1ec42]
  [bt] (2) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::EmptyShared(std::string const&, std::vector<long, std::allocator<long> >, DLDataType, DLContext, bool)+0x1ed) [0x7f8774d16c3d]
  [bt] (3) /usr/local/lib/python3.8/site-packages/dgl/libdgl.so(DGLArrayAllocSharedMem+0x125) [0x7f8774d173c5]
  [bt] (4) /usr/local/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f87ede8e9dd]
  [bt] (5) /usr/local/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f87ede8e067]
  [bt] (6) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x1097a) [0x7f87edb5c97a]
  [bt] (7) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x110db) [0x7f87edb5d0db]
  [bt] (8) /usr/local/bin/python3(_PyObject_MakeTpCall+0x22f) [0x555f8016950f]

Expected behavior

If the server and the client are on the same machine, the shared memory can be used for communication. Otherwise, only a placeholder for the corresponding tensor is needed to be created on the client side, and it is no need to check the shared memory.

Environment

DGL Version (e.g., 1.0): dgl-cu102 (v0.5.2)
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6.0
OS (e.g., Linux): Debian GNU/Linux 10
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8.3
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): Tesla V100

jermainewang commented 3 years ago

@aksnzhy

aksnzhy commented 3 years ago

@chhzh123 Thanks for reporting this issue. You are right, we assume that on each machine we have a local server process for now. We should support the case that they are on different machines.

chhzh123 commented 3 years ago

@chhzh123 Thanks for reporting this issue. You are right, we assume that on each machine we have a local server process for now. We should support the case that they are on different machines.

So are there any workarounds that I could try to avoid this issue?

VoVAllen commented 3 years ago

I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?

chhzh123 commented 3 years ago

I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?

We have large graph embeddings which need to be stored on several machines. The client needs to fetch embeddings from these machines to do computation.

aksnzhy commented 3 years ago

I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?

We have large graph embeddings which need to be stored on several machines. The client needs to fetch embeddings from these machines to do computation.

Can we use these machines for both storing embedding and training? You can tune the number of server count and trainer count for the best performance.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 2 years ago

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

dmlc / dgl