apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

Open perdasilva opened 5 years ago

perdasilva commented 5 years ago

Description

Test Large Tensor: GPU step is failing with:

======================================================================
ERROR: test_large_array.test_ndarray_random_randint
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/nightly/test_large_array.py", line 70, in test_ndarray_random_randint
    assert a.__gt__(low) & a.__lt__(high)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 336, in __gt__
    return greater(self, other)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 3376, in greater
    _internal._lesser_scalar)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2704, in _ufunc_helper
    return fn_array(lhs, rhs)
  File "<string>", line 46, in broadcast_greater
  File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:39:26] /work/mxnet/src/io/../operator/elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node  at 1-th input: expected int32, got int64
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3c) [0x7fa0e59e8b3c]
  [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)::{lambda(std::vector<int, std::allocator<int> > const&, unsigned long, char const*)#1}::operator()(std::vector<int, std::allocator<int> > const&, unsigned long, char const*) const+0x62d) [0x7fa0e8c6866d]
  [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)+0x2f3) [0x7fa0e8f963a3]
  [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseType<2l, 1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)+0x34d) [0x7fa0e8f968ed]
  [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<bool (nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*), bool (*)(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*&&, std::vector<int, std::allocator<int> >*&&)+0x1d) [0x7fa0e8bb909d]
  [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x6a5) [0x7fa0e8c28e35]
  [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x10b) [0x7fa0e8c0f52b]
  [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x1c9) [0x7fa0e8a8a479]
  [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeEx+0x8f) [0x7fa0e8a8a97f]

-------------------- >> begin captured logging << --------------------
tests.python.unittest.common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2073509752 to reproduce.
--------------------- >> end captured logging << ---------------------

see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/312/pipeline/144 for the full log

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Test, CI

vdantu commented 5 years ago

@mxnet-label-bot add [test] @apeforest

roywei commented 5 years ago

fixed in latest run, we can close this now: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/320/pipeline

roywei commented 5 years ago

actually, we can't close it yet, this test was fixed but went back to failing after https://github.com/apache/incubator-mxnet/pull/15059. Similar OOM issue in https://github.com/apache/incubator-mxnet/issues/14980

roywei commented 5 years ago

Currently, both CPU and GPU tests have been disabled due to the same memory issue. Had a discussion with @access2rohit and @apeforest, we can try a few things:

  1. change to P3 instances here https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/JenkinsfileForBinaries#L82
  2. further increase shared memory to 50G
  3. stop running large tensor test parallelly with other tests.

We are having problems testing the above solutions on CI machines that have multiple jobs running in parallel.

roywei commented 5 years ago

failed with 200G shared memory on P3.2x and failed, we need another approach for testing large tensor.