apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Test failure and possible bug on GPU topology algorithm (test_device.test_device_pushpull) #12994

Open larroy opened 5 years ago

larroy commented 5 years ago

Description

Failure in test_device.test_device_pushpull is reported by NVidia in DGX1V.

I suspect there is a bug on the binary tree creation. I'm looking into this issue.

ERROR: test_device.test_device_pushpull
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
   self.test(*self.arg)
File "/opt/mxnet/tests/python/gpu/test_device.py", line 74, in test_device_pushpull
   check_dense_pushpull('device')
File "/opt/mxnet/tests/python/gpu/test_device.py", line 61, in check_dense_pushpull
   kv_device.push(cur_key, arr_list)
File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push
   self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
   raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:44:02] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

Package used (Python/R/Scala/Julia): (I'm using ...)

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash: (Paste the output of git rev-parse HEAD here.)

Build config: (Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv
[17:47:41] src/kvstore/././comm.h:752: v.vv.
[17:47:41] src/kvstore/././comm.h:752: vv.v.
[17:47:41] src/kvstore/././comm.h:752: vvv..
[17:47:41] src/kvstore/././comm.h:752: v....
[17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv.
[17:47:41] src/kvstore/././comm.h:752: v.vv.v
[17:47:41] src/kvstore/././comm.h:752: vv.v..
[17:47:41] src/kvstore/././comm.h:752: vvv...
[17:47:41] src/kvstore/././comm.h:752: v....v
[17:47:41] src/kvstore/././comm.h:752: .v..v.
[17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv..
[17:47:41] src/kvstore/././comm.h:752: v.vv.v.
[17:47:41] src/kvstore/././comm.h:752: vv.v..v
[17:47:41] src/kvstore/././comm.h:752: vvv....
[17:47:41] src/kvstore/././comm.h:752: v....vv
[17:47:41] src/kvstore/././comm.h:752: .v..v.v
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.
[17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[17:47:41] src/kvstore/././comm.h:752: .vvvv...
[17:47:41] src/kvstore/././comm.h:752: v.vv.v..
[17:47:41] src/kvstore/././comm.h:752: vv.v..v.
[17:47:41] src/kvstore/././comm.h:752: vvv....v
[17:47:41] src/kvstore/././comm.h:752: v....vvv
[17:47:41] src/kvstore/././comm.h:752: .v..v.vv
[17:47:41] src/kvstore/././comm.h:752: ..v.vv.v
[17:47:41] src/kvstore/././comm.h:752: ...vvvv.
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[17:47:41] src/kvstore/././comm_tree.h:392: Using Tree
[17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
[17:47:41] src/kvstore/././comm_tree.h:392: Using Tree
[17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times
[17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees
Traceback (most recent call last):
  File "test_device.py", line 82, in <module>
    test_device_pushpull()
  File "test_device.py", line 74, in test_device_pushpull
    check_dense_pushpull('device')
  File "test_device.py", line 61, in check_dense_pushpull
    kv_device.push(cur_key, arr_list)
  File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push
    self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
  File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:47:41] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7ffa6698659c]
[bt] (1) /usr/local/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ffa66987918]
[bt] (2) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTreesFromRoot<float>(std::vector<float, std::allocator<float> >*, int, int, float, bool, std::vector<unsigned long, std::allocator<unsigned long> >*, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x1a65) [0x7ffa69a59ff5]
[bt] (3) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTrees<float>(std::vector<float, std::allocator<float> > const&, int, float, bool, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*)+0x5b5) [0x7ffa69a5a815]
[bt] (4) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::QueryTopology()+0x1609) [0x7ffa69a5d409]
[bt] (5) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x137c) [0x7ffa69a5f0cc]
[bt] (6) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x1b9) [0x7ffa69a60ec9]
[bt] (7) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::Push(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0xc6) [0x7ffa69a02ee6]
[bt] (8) /usr/local/lib/libmxnet.so(MXKVStorePushEx+0x205) [0x7ffa6993d1d5]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7ffab18e9e20]

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. 2.

What have you tried to solve it?

1. 2.

frankfliu commented 5 years ago

@mxnet-label-bot [Test]

larroy commented 5 years ago

I tried to reproduce in p3.16xlarge but didn't make it to fail: compiled with:

Executed:


nosetests -s --with-timer --with-xunit --xunit-file nosetests_unittest_testdevice.xml --verbose tests/python/gpu/test_device.py 2>&1 | tee unittest_testdevice.log

----------------------------------------------------------------------
XML: /home/piotr/mxnet/nosetests_unittest_testdevice.xml
[success] 100.00% test_device.test_device_pushpull: 0.0010s
----------------------------------------------------------------------
Ran 1 test in 0.002s

Compiled with:

#!/bin/bash
set -e
set -x

renice -n 19 -p $$

mkdir -p build && cd build
#cmake -DUSE_CPP_PACKAGE=ON -DUSE_CUDA=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON -DCMAKE_BUILD_TYPE=Debug ..
cmake\
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DUSE_CPP_PACKAGE=ON\
    -DUSE_CUDA=ON\
    -DUSE_OPENMP=ON\
    -DUSE_OPENCV=ON\
    -DCMAKE_BUILD_TYPE=Release\
    -GNinja ..
ninja -v
#cmake -DUSE_CPP_PACKAGE=ON -DUSE_CUDA=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON ..
#VERBOSE=1 make -j5
cd ..
if [ ! -d mxnet_py3 ]; then
    virtualenv -p `which python3` mxnet_py3
fi
source mxnet_py3/bin/activate
cd python
pip install -e .
cd ..
pip install opencv-python
pip install ipython
pip install matplotlib
pip install nose
pip install nose-timer

########

commit 722ad7a7de8390372a27cc52725bdcf29b242ea9

larroy commented 5 years ago

Was able to reproduce in p3.16x compiling in release mode, will try to fix and keep this updated.

cmake\
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DUSE_CPP_PACKAGE=ON\
    -DUSE_CUDA=ON\
    -DUSE_OPENMP=ON\
    -DUSE_OPENCV=ON\
    -DCMAKE_BUILD_TYPE=Release\
    -GNinja ..
ninja -v
======================================================================
ERROR: test_device.test_device_pushpull
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/piotr/mxnet_other/mxnet_py3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/piotr/mxnet_other/tests/python/gpu/test_device.py", line 74, in test_device_pushpull
    check_dense_pushpull('device')
  File "/home/piotr/mxnet_other/tests/python/gpu/test_device.py", line 61, in check_dense_pushpull
    kv_device.push(cur_key, arr_list)
  File "/home/piotr/mxnet_other/python/mxnet/kvstore.py", line 234, in push
    self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
  File "/home/piotr/mxnet_other/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:57:12] ../src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking

Stack trace returned 10 entries:
[bt] (0) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f080c46b2fc]
[bt] (1) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f080c46c6a8]
[bt] (2) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(void mxnet::kvstore::ComputeTreesFromRoot<float>(std::vector<float, std::allocator<float> >*, int, int$ float, bool, std::vector<unsigned long, std::allocator<unsigned long> >*, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x1621) [0x7f080fd6e8e1]
[bt] (3) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(void mxnet::kvstore::ComputeTrees<float>(std::vector<float, std::allocator<float> > const&, int, float,
 bool, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*, std::vector<st
d::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*)+0x2f3) [0x7f080fd6ef13]
[bt] (4) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDeviceTree::QueryTopology()+0xefd) [0x7f080fd7126d]
[bt] (5) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDeviceTree::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray>
> const&, int)+0xf70) [0x7f080fd726b0]
[bt] (6) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet
::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x1b8) [0x7f080fd73858]
[bt] (7) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::Push(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocato
r<mxnet::NDArray> > const&, int)+0xc5) [0x7f080fd4efb5]
[bt] (8) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(MXKVStorePushEx+0x16d) [0x7f080ff1a7fd]
[bt] (9) /home/piotr/mxnet_other/mxnet_py3/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f083369ae20]
larroy commented 5 years ago

I found out the root cause of this, we are unable to perform graph partitioning on 8 GPUs such as p3.16xlarge or DGX1 when using tree in kvstore. We need to fix graph partitioning and add a regression test.

In this case K-L fails finding a graph partition and the fallback using BFS doesn't seem to work (and is not currently well unit tested).