Hanging flaky test test_operator.test_norm @ Python 3: GPU Win

marcoabreu commented 6 years ago

test_operator.test_norm causes hangs at Python 3: GPU Win.

http://jenkins.mxnet-ci.amazon-ml.com/view/Master%20Jobs/job/incubator-mxnet/job/master/1109/

test_operator.test_pad ... ok

test_operator.test_instance_normalization ... ok

test_operator.test_l2_normalization ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.

ok

test_operator.test_norm ... Sending interrupt signal to process

After 10s process did not stop

marcoabreu commented 6 years ago

On other occasions, it causes assertion errors: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1108/pipeline

======================================================================

FAIL: test_operator.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_operator.py", line 3074, in test_norm

    check_numeric_gradient(norm_sym, [in_data], numeric_eps=epsilon, rtol=1e-2, atol=1e-3)

  File "/work/mxnet/python/mxnet/test_utils.py", line 914, in check_numeric_gradient

    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.028974 exceeds tolerance rtol=0.010000, atol=0.001000.  Location of maximum error:(0, 1, 4, 5, 0, 0), a=-0.003815, b=-0.002757

 NUMERICAL_data: array([[[[[[ 0.01990795, -0.06532669,  0.00363588, ...,  0.01215935,

            -0.00745058, -0.00047684],

           [-0.00017881, -0.10442734,  0.01651049, ..., -0.03290176,...

 BACKWARD_data: array([[[[[[ 0.01993995, -0.06537328,  0.00367729, ...,  0.01226579,

            -0.0074518 , -0.0004907 ],

           [-0.00015076, -0.10442342,  0.01648286, ..., -0.03304606,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=615595105 to reproduce.

--------------------- >> end captured logging << ---------------------

eric-haibin-lin commented 6 years ago

@anirudhacharya

anirudhacharya commented 6 years ago

I will look into it

leezu commented 6 years ago

The latter error could be due to that numpy.linalg.norm is not numerically stable (neither is the mxnet one). Possibly #11573 could fix this issue.

szha commented 6 years ago

FYI @anirudhacharya the test has been partially re-enabled in #11573.

KellenSunderland commented 6 years ago

@szha Would it make sense to close this issue and re-open if we see test issuse with test_norm again?

anirudhacharya commented 6 years ago

@KellenSunderland as Sheng mentioned it has only been partially enabled, there are certain test cases which are still not enabled and fix that was merged does not fix the occasional hanging of the test case on the CI.

leezu commented 6 years ago

@anirudh2290 @KellenSunderland Only the part that caused assertion errors during numerical checking of the gradient is disabled. It is not clear if the hang was occurring during check_numeric_gradient or during the other parts in the test. If the hang occurred outside of check_numeric_gradient it is fixed by change of underlying norm implementation, otherwise it is disabled..

You may want to rename this issue to make it clear that currently a part of the test is disabled due to numerical instability of the check_numeric_gradient.

haojin2 commented 6 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/22/pipeline/858

======================================================================

FAIL: test_operator_gpu.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/../unittest/test_ndarray.py", line 1324, in test_norm

    mx.test_utils.assert_almost_equal(npy_out, mx_out.asnumpy())

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 200000.015625 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(0, 0, 0, 0), a=0.567434, b=-0.567434

 a: array([[[[0.56743413, 0.3482769 ],

         [0.76402813, 0.21444368],

         [0.26532394, 0.35806283],...

 b: array([[[[-0.56743413, -0.3482769 ],

         [-0.76402813, -0.21444368],

         [-0.26532394, -0.35806283],...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1479125905 to reproduce.

--------------------- >> end captured logging << ---------------------

marcoabreu commented 6 years ago

@leezu

szha commented 6 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861

leezu commented 6 years ago

I'm looking into the issue now.

leezu commented 6 years ago

I can't reproduce this on p3 and p2 instances so far. I keep running this test in a loop for now. Do the tests run on a different instance type?

szha commented 6 years ago

Hao and I weren't able to reproduce the error on the (supposedly) same instance type that CI slaves use, using the same docker building logic.

marcoabreu commented 6 years ago

G3

Are you running with our docker setup?

haojin2 commented 6 years ago

yes, Sheng and I were using g3.8xlarge + instructions from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results

leezu commented 6 years ago

@marcoabreu is it feasible to restart the CI Pipeline with the same global seed (not only test seed)? Jenkins should make it feasible to restart the pipeline when accessing http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861/ while being logged in, but I'm not sure about the setup for global seed.

marcoabreu commented 6 years ago

I don't think this is feasible, considering the setup is entirely reproducible with docker. Also the costs are not worth the effort.

I might be stating the obvious, but did you use the latest master or the commit at that time to reproduce the error?

szha commented 6 years ago

@marcoabreu Hao and I have been using my branch and I imagine @leezu is doing the same.

anirudhacharya commented 6 years ago

Update on this issue -

I ran this test case ~5k times for different scenarios on a Linux GPU machine (EC2 p2.8x large instance). Currently the reasons for test failure are -

check_numeric_gradient has precision issues when run with np.float16 and np.float32. I have updated set of atol and rtol values that can fix this. But the new atol and rtol values still have to be verified with repeated runs of the test.
The bigger bug is that the test intermittently hangs while performing the check_numeric_gradient. check_symbolic_backward and check_symbolic_forward works correctly and consistently.

The above two issues are only reproducible when run against CPU context. When run against GPU the test neither hangs nor does it have accuracy issues.

I need to check the same with a Windows instance.

anirudhacharya commented 6 years ago

With regards to the hanging check_numeric_gradient -

It is not reproducible when I run it in a single-threaded environment with MXNET_TEST_COUNT=500 MXNET_ENGINE_TYPE=NaiveEngine nosetests --nocapture --verbose tests/python/unittest/test_operator.py:test_norm.

It only hangs in a multi-threaded environment, due to resource contention. I have been using strace -p <PID> to inspect the stack trace when this test runs, but I am yet to narrow down on the exact source of the resource contention.

The strace logs have recurring occurrences of the following system calls from competing threads -

101312 futex(0x22c1f90, FUTEX_WAKE_PRIVATE, 1) = 0
101169 futex(0x3b1f5bc, FUTEX_WAIT_PRIVATE, 6862611, NULL <unfinished ...>
101312 futex(0x7f19a4005ce4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
101341 futex(0x7f19a4015194, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
101312 futex(0x3b1f5bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x3b1f590, 6862612 <unfinished ...>
101341 <... futex resumed> )            = 0
101169 <... futex resumed> )            = 0
101312 <... futex resumed> )            = 1

apache / mxnet

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509