Open marcoabreu opened 6 years ago
On other occasions, it causes assertion errors: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1108/pipeline
======================================================================
FAIL: test_operator.test_norm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/unittest/test_operator.py", line 3074, in test_norm
check_numeric_gradient(norm_sym, [in_data], numeric_eps=epsilon, rtol=1e-2, atol=1e-3)
File "/work/mxnet/python/mxnet/test_utils.py", line 914, in check_numeric_gradient
("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.028974 exceeds tolerance rtol=0.010000, atol=0.001000. Location of maximum error:(0, 1, 4, 5, 0, 0), a=-0.003815, b=-0.002757
NUMERICAL_data: array([[[[[[ 0.01990795, -0.06532669, 0.00363588, ..., 0.01215935,
-0.00745058, -0.00047684],
[-0.00017881, -0.10442734, 0.01651049, ..., -0.03290176,...
BACKWARD_data: array([[[[[[ 0.01993995, -0.06537328, 0.00367729, ..., 0.01226579,
-0.0074518 , -0.0004907 ],
[-0.00015076, -0.10442342, 0.01648286, ..., -0.03304606,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=615595105 to reproduce.
--------------------- >> end captured logging << ---------------------
@anirudhacharya
I will look into it
The latter error could be due to that numpy.linalg.norm is not numerically stable (neither is the mxnet one). Possibly #11573 could fix this issue.
FYI @anirudhacharya the test has been partially re-enabled in #11573.
@szha Would it make sense to close this issue and re-open if we see test issuse with test_norm again?
@KellenSunderland as Sheng mentioned it has only been partially enabled, there are certain test cases which are still not enabled and fix that was merged does not fix the occasional hanging of the test case on the CI.
@anirudh2290 @KellenSunderland Only the part that caused assertion errors during numerical checking of the gradient is disabled. It is not clear if the hang was occurring during check_numeric_gradient
or during the other parts in the test. If the hang occurred outside of check_numeric_gradient
it is fixed by change of underlying norm implementation, otherwise it is disabled..
You may want to rename this issue to make it clear that currently a part of the test is disabled due to numerical instability of the check_numeric_gradient
.
======================================================================
FAIL: test_operator_gpu.test_norm
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
return func(*arg, **kw)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/../unittest/test_ndarray.py", line 1324, in test_norm
mx.test_utils.assert_almost_equal(npy_out, mx_out.asnumpy())
File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 200000.015625 exceeds tolerance rtol=0.000010, atol=0.000000. Location of maximum error:(0, 0, 0, 0), a=0.567434, b=-0.567434
a: array([[[[0.56743413, 0.3482769 ],
[0.76402813, 0.21444368],
[0.26532394, 0.35806283],...
b: array([[[[-0.56743413, -0.3482769 ],
[-0.76402813, -0.21444368],
[-0.26532394, -0.35806283],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1479125905 to reproduce.
--------------------- >> end captured logging << ---------------------
@leezu
I'm looking into the issue now.
I can't reproduce this on p3 and p2 instances so far. I keep running this test in a loop for now. Do the tests run on a different instance type?
Hao and I weren't able to reproduce the error on the (supposedly) same instance type that CI slaves use, using the same docker building logic.
G3
Are you running with our docker setup?
yes, Sheng and I were using g3.8xlarge + instructions from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results
@marcoabreu is it feasible to restart the CI Pipeline with the same global seed (not only test seed)? Jenkins should make it feasible to restart the pipeline when accessing http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861/ while being logged in, but I'm not sure about the setup for global seed.
I don't think this is feasible, considering the setup is entirely reproducible with docker. Also the costs are not worth the effort.
I might be stating the obvious, but did you use the latest master or the commit at that time to reproduce the error?
@marcoabreu Hao and I have been using my branch and I imagine @leezu is doing the same.
Update on this issue -
I ran this test case ~5k times for different scenarios on a Linux GPU machine (EC2 p2.8x large instance). Currently the reasons for test failure are -
check_numeric_gradient
has precision issues when run with np.float16 and np.float32. I have updated set of atol and rtol values that can fix this. But the new atol and rtol values still have to be verified with repeated runs of the test.check_numeric_gradient
. check_symbolic_backward
and check_symbolic_forward
works correctly and consistently.The above two issues are only reproducible when run against CPU context. When run against GPU the test neither hangs nor does it have accuracy issues.
I need to check the same with a Windows instance.
With regards to the hanging check_numeric_gradient
-
It is not reproducible when I run it in a single-threaded environment with MXNET_TEST_COUNT=500 MXNET_ENGINE_TYPE=NaiveEngine nosetests --nocapture --verbose tests/python/unittest/test_operator.py:test_norm
.
It only hangs in a multi-threaded environment, due to resource contention. I have been using strace -p <PID>
to inspect the stack trace when this test runs, but I am yet to narrow down on the exact source of the resource contention.
The strace logs have recurring occurrences of the following system calls from competing threads -
101312 futex(0x22c1f90, FUTEX_WAKE_PRIVATE, 1) = 0
101169 futex(0x3b1f5bc, FUTEX_WAIT_PRIVATE, 6862611, NULL <unfinished ...>
101312 futex(0x7f19a4005ce4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
101341 futex(0x7f19a4015194, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
101312 futex(0x3b1f5bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x3b1f590, 6862612 <unfinished ...>
101341 <... futex resumed> ) = 0
101169 <... futex resumed> ) = 0
101312 <... futex resumed> ) = 1
test_operator.test_norm causes hangs at Python 3: GPU Win.
http://jenkins.mxnet-ci.amazon-ml.com/view/Master%20Jobs/job/incubator-mxnet/job/master/1109/