Open larroy opened 5 years ago
@mxnet-label-bot [Test]
I tried to reproduce in p3.16xlarge but didn't make it to fail: compiled with:
Executed:
nosetests -s --with-timer --with-xunit --xunit-file nosetests_unittest_testdevice.xml --verbose tests/python/gpu/test_device.py 2>&1 | tee unittest_testdevice.log
----------------------------------------------------------------------
XML: /home/piotr/mxnet/nosetests_unittest_testdevice.xml
[success] 100.00% test_device.test_device_pushpull: 0.0010s
----------------------------------------------------------------------
Ran 1 test in 0.002s
Compiled with:
#!/bin/bash
set -e
set -x
renice -n 19 -p $$
mkdir -p build && cd build
#cmake -DUSE_CPP_PACKAGE=ON -DUSE_CUDA=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON -DCMAKE_BUILD_TYPE=Debug ..
cmake\
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DUSE_CPP_PACKAGE=ON\
-DUSE_CUDA=ON\
-DUSE_OPENMP=ON\
-DUSE_OPENCV=ON\
-DCMAKE_BUILD_TYPE=Release\
-GNinja ..
ninja -v
#cmake -DUSE_CPP_PACKAGE=ON -DUSE_CUDA=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON ..
#VERBOSE=1 make -j5
cd ..
if [ ! -d mxnet_py3 ]; then
virtualenv -p `which python3` mxnet_py3
fi
source mxnet_py3/bin/activate
cd python
pip install -e .
cd ..
pip install opencv-python
pip install ipython
pip install matplotlib
pip install nose
pip install nose-timer
########
commit 722ad7a7de8390372a27cc52725bdcf29b242ea9
Was able to reproduce in p3.16x compiling in release mode, will try to fix and keep this updated.
cmake\
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DUSE_CPP_PACKAGE=ON\
-DUSE_CUDA=ON\
-DUSE_OPENMP=ON\
-DUSE_OPENCV=ON\
-DCMAKE_BUILD_TYPE=Release\
-GNinja ..
ninja -v
======================================================================
ERROR: test_device.test_device_pushpull
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/piotr/mxnet_other/mxnet_py3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/piotr/mxnet_other/tests/python/gpu/test_device.py", line 74, in test_device_pushpull
check_dense_pushpull('device')
File "/home/piotr/mxnet_other/tests/python/gpu/test_device.py", line 61, in check_dense_pushpull
kv_device.push(cur_key, arr_list)
File "/home/piotr/mxnet_other/python/mxnet/kvstore.py", line 234, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/home/piotr/mxnet_other/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:57:12] ../src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking
Stack trace returned 10 entries:
[bt] (0) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f080c46b2fc]
[bt] (1) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f080c46c6a8]
[bt] (2) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(void mxnet::kvstore::ComputeTreesFromRoot<float>(std::vector<float, std::allocator<float> >*, int, int$ float, bool, std::vector<unsigned long, std::allocator<unsigned long> >*, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x1621) [0x7f080fd6e8e1]
[bt] (3) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(void mxnet::kvstore::ComputeTrees<float>(std::vector<float, std::allocator<float> > const&, int, float,
bool, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*, std::vector<st
d::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*)+0x2f3) [0x7f080fd6ef13]
[bt] (4) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDeviceTree::QueryTopology()+0xefd) [0x7f080fd7126d]
[bt] (5) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDeviceTree::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray>
> const&, int)+0xf70) [0x7f080fd726b0]
[bt] (6) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet
::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x1b8) [0x7f080fd73858]
[bt] (7) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::Push(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocato
r<mxnet::NDArray> > const&, int)+0xc5) [0x7f080fd4efb5]
[bt] (8) /home/piotr/mxnet_other/python/mxnet/../../build/libmxnet.so(MXKVStorePushEx+0x16d) [0x7f080ff1a7fd]
[bt] (9) /home/piotr/mxnet_other/mxnet_py3/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f083369ae20]
I found out the root cause of this, we are unable to perform graph partitioning on 8 GPUs such as p3.16xlarge or DGX1 when using tree in kvstore. We need to fix graph partitioning and add a regression test.
In this case K-L fails finding a graph partition and the fallback using BFS doesn't seem to work (and is not currently well unit tested).
Description
Failure in test_device.test_device_pushpull is reported by NVidia in DGX1V.
I suspect there is a bug on the binary tree creation. I'm looking into this issue.
Environment info (Required)
Package used (Python/R/Scala/Julia): (I'm using ...)
For Scala user, please provide:
java -version
)mvn -version
)scala -version
)For R user, please provide R
sessionInfo()
:Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash: (Paste the output of
git rev-parse HEAD
here.)Build config: (Paste the content of config.mk, or the build command.)
Error Message:
(Paste the complete error message, including stack trace.)
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
1. 2.
What have you tried to solve it?
1. 2.