Closed lebeg closed 6 years ago
@luobao-intel please take a look for the reason.
This test is to validate the activation calculation in mkldnn by checking the gradient compared to the theano.gradient.numeric_grad. However, the activation gradient calculation of code referring to theano is not correct with the input closed to zero. Thus, flaky errors occurred when there are some extremely small positive numbers in the random input vector. The experiment is as follows.
input data :[[1, 2], [3, 0.0001]]
location: {'data': <RowSparseNDArray 2x2 @cpu(0)>, '__random_proj': [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]] <NDArray 2x2 @cpu(0)>}
gradient calculation referring to theano : [[0.35466552 0.8954048 ] [0.40476322 0.39395675]]
mkldnn : [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]]
input data :[[1, -2], [-4, 0.0005]]
location: {'data': <RowSparseNDArray 2x2 @cpu(0)>, '__random_proj': [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]] <NDArray 2x2 @cpu(0)>}
gradient calculation referring to theano : [[0.35466552 0. ] [0. 0.4248553 ]]
mkldnn : [[0.3546685 0. ] [0. 0.7724642]]
It's easy to know that the derivative of ReLU function is : if x < 0, output is 0. if x > 0, output is 1.
Therefore, in the check_numeric_gradient function, the gradient of executor should be equal to location if the corresponding element of input data is positive and be 0 otherwise by element-wise. The gradient based on theano is apparently false when the corresponding element of input data is close to zero.
The reference checker applied the finite difference method but the eps is too large for float datatype in here.
In @luobao-intel case, the input data is about xe-5
, so the eps can't calculate correctly.
I suggest changing eps to 1e-6
. @luobao-intel will fill the PR soon.
Is failing again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1563/pipeline
======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
check_activation_training(stype)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.153736 exceeds tolerance rtol=0.160000, atol=0.000100. Location of maximum error:(0, 2, 1, 1), a=0.119209, b=0.146338
NUMERICAL_data: array([[[[0.32782555, 0.52154064],
[0.32782555, 0. ]],
...
BACKWARD_data: array([[[[0.31696534, 0.53385574],
[0.3415597 , 0. ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
--------------------- >> end captured logging << ---------------------
sorry, I can't reproduce the same result with the same random seed XNET_TEST_SEED=304218922. In my trial, the test_activation is ok. The experiment is shown as follows:
export MXNET_TEST_SEED=304218922 python /usr/bin/nosetests tests/python/mkl/test_mkldnn.py:test_activation
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce. [22:51:22] src/operator/tensor/././../../common/utils.h:450: Storage type fallback detected: operator = Activation input storage types = [row_sparse, ] output storage types = [default, ] params = {"act_type" : relu, } context.dev_mask = cpu The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
Ran 1 test in 0.023s
OK
Failing again - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-12391/runs/9/nodes/951/log/?start=0
======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
check_activation_training(stype)
File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.184596 exceeds tolerance rtol=0.160000, atol=0.000100. Location of maximum error:(0, 0, 1, 0), a=0.715256, b=0.882672
NUMERICAL_data: array([[[[0. , 0. ],
[0.71525574, 0. ]],
...
BACKWARD_data: array([[[[0. , 0. ],
[0.8826717 , 0. ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1731055743 to reproduce.
--------------------- >> end captured logging << ---------------------
Sorry for that, in previous situation, we confronted the situation that element of input data is close to zero. And we found that the extremely big difference step, eps, should be blamed. And we turned its value down. However, in current situation, test failure is caused by the small eps for the big element of input data. The smaller the eps is, the more calculation steps are required. And for big input data, every step could cause small error and finally the cumulative error may exceed the limit. Thus, suitable eps should be picked.
After all, those problems are caused by the inaccurate baseline calculation referring to the theano gradient. We are trying to rewrite the test case with other approaches. I suggest to disable the flaky test for the time being.
PR to disable the test again: https://github.com/apache/incubator-mxnet/pull/12516
made a PR that addresses just this test (ran 10000 times with different seeds as well) https://github.com/apache/incubator-mxnet/pull/12560.
in regards to @luobao-intel , this is not due to inputs being too large. activation is linear above 0 so this is not due to lack of approximation. in fact we should be able to get an exact solution. the reason the change is causing an error is the fact that with a very small eps the outputs (f(x + eps/2) and f(x - eps/2)) do not have enough precision.
the formula is
grad = (f(x + eps/2) - f(x - eps/s)) / eps).
since eps was 1e-6 this means the gradient was calculated by differences must be captured below 1e-6.
tldr: you should never use anything less than 1e-5 as there is not enough precision in the numerator (f(x + eps/2) - f(x - eps/s)) to derive an accurate slope.
Has been fixed with https://github.com/apache/incubator-mxnet/pull/12418
Sorry, probably this is the fix: https://github.com/apache/incubator-mxnet/pull/12560
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1529/pipeline