apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Flaky test: test_ops.test_convolution2d #16770

Open haojin2 opened 4 years ago

haojin2 commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16720/6/pipeline/

test_ops.test_convolution2d ... [23:44:59] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution2) is not supported by TensorRT

[23:44:59] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution2) is not supported by TensorRT

[23:44:59] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution3) is not supported by TensorRT

[23:44:59] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution3) is not supported by TensorRT

[23:45:00] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution6) is not supported by TensorRT

[23:45:00] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution6) is not supported by TensorRT

[23:45:00] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution7) is not supported by TensorRT

[23:45:00] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution7) is not supported by TensorRT

[23:45:02] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution10) is not supported by TensorRT

[23:45:02] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution10) is not supported by TensorRT

[23:45:02] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution11) is not supported by TensorRT

[23:45:02] /work/mxnet/src/operator/subgraph/tensorrt/./tensorrt-inl.h:159: Warning: NHWC layout (node: convolution11) is not supported by TensorRT

Segmentation fault: 11

Stack trace:

  [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0x3987d59) [0x7f00a9192d59]

  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f01298c0f20]

  [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::nnvm_to_onnx::ConvertConstant(onnx::GraphProto*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::NDArray> > > const*)+0xa11) [0x7f00aa167c21]

  [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::nnvm_to_onnx::ConvertNnvmGraphToOnnx(nnvm::Graph const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::NDArray> > >*)+0xfcd) [0x7f00aa16d3ad]

  [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::TRTCreateState(nnvm::NodeAttrs const&, mxnet::Context, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<int, std::allocator<int> > const&)+0xd52) [0x7f00aa171e22]

  [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<mxnet::OpStatePtr (nnvm::NodeAttrs const&, mxnet::Context, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<int, std::allocator<int> > const&), mxnet::OpStatePtr (*)(nnvm::NodeAttrs const&, mxnet::Context, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<int, std::allocator<int> > const&)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, mxnet::Context&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<int, std::allocator<int> > const&)+0x2f) [0x7f00a9033dcf]

  [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::CreateOpExecs(nnvm::Graph const&, std::vector<std::shared_ptr<mxnet::exec::OpExecutor>, std::allocator<std::shared_ptr<mxnet::exec::OpExecutor> > >*, std::vector<mxnet::OpStatePtr, std::allocator<mxnet::OpStatePtr> >*, unsigned long)+0x102b) [0x7f00a90db63b]

  [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::AttachOpExecs(nnvm::Graph)+0x100) [0x7f00a90dc660]

  [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::GraphExecutor::FinishInitGraph(nnvm::Symbol, nnvm::Graph, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x4dc) [0x7f00a910176c]
haojin2 commented 4 years ago

Happening again at: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16786/10/pipeline/362

======================================================================

FAIL: test_ops.test_deconvolution2d

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/tensorrt/../unittest/common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 210, in test_deconvolution2d

    rtol_fp16=rtol_fp16, atol_fp16=atol_fp16)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 107, in check_single_sym

    assert_allclose(fp32, orig, rtol=rtol_fp32, atol=atol_fp32)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1515, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 841, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=1e-06, atol=0

Mismatch: 0.0349%

Max absolute difference: 4.7683716e-07

Max relative difference: 1.682386e-06

 x: array([[[[1.121419, 0.82041 , 1.309779, ..., 0.503725, 0.834245,

          0.764753],

         [0.745547, 1.025158, 0.686113, ..., 0.813022, 0.677198,...

 y: array([[[[1.121419, 0.82041 , 1.309779, ..., 0.503725, 0.834245,

          0.764753],

         [0.745547, 1.025158, 0.686113, ..., 0.813022, 0.677198,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=573353477 to reproduce.

--------------------- >> end captured logging << ---------------------
haojin2 commented 4 years ago

Happening again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16801/13/pipeline/362

======================================================================

FAIL: test_ops.test_deconvolution2d

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/tensorrt/../unittest/common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 222, in test_deconvolution2d

    rtol_fp16=rtol_fp16, atol_fp16=atol_fp16)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 107, in check_single_sym

    assert_allclose(fp32, orig, rtol=rtol_fp32, atol=atol_fp32)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1515, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 841, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=5e-05, atol=0

Mismatch: 0.00551%

Max absolute difference: 1.2367964e-05

Max relative difference: 5.450864e-05

 x: array([[[[0.797838, 0.974776, 2.300395, ..., 1.709959, 2.381261,

          1.535094],

         [1.315217, 2.675796, 5.480255, ..., 4.280962, 5.190691,...

 y: array([[[[0.797841, 0.974777, 2.300396, ..., 1.709959, 2.38126 ,

          1.535094],

         [1.315219, 2.675797, 5.480257, ..., 4.280962, 5.190691,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=609880636 to reproduce.

--------------------- >> end captured logging << ---------------------
haojin2 commented 4 years ago

@ptrendx @DickJC123 This is happening quite often for TensorRT tests, can you guys probably take a look? I believe it could also be flaky in 1.6.0 branch.

TaoLv commented 4 years ago

Also happens here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17138/6/pipeline

======================================================================

FAIL: test_ops.test_convolution2d

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/tensorrt/../unittest/common.py", line 221, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 159, in test_convolution2d

    rtol_fp16=rtol_fp16, atol_fp16=atol_fp16)

  File "/work/mxnet/tests/python/tensorrt/test_ops.py", line 107, in check_single_sym

    assert_allclose(fp32, orig, rtol=rtol_fp32, atol=atol_fp32)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=0, atol=0

Mismatched elements: 7141 / 12544 (56.9%)

Max absolute difference: 1.1920929e-06

Max relative difference: 4.121238e-07

 x: array([[[[2.720627, 3.646537, 2.396177, ..., 2.24084 , 2.384826,

          3.200811],

         [1.746048, 3.376214, 2.557979, ..., 2.242502, 2.777753,...

 y: array([[[[2.720627, 3.646537, 2.396177, ..., 2.24084 , 2.384826,

          3.200811],

         [1.746048, 3.376214, 2.557979, ..., 2.242502, 2.777753,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=752792231 to reproduce.

--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
larroy commented 4 years ago

Another one

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15990/20/pipeline