apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Flaky test: test_mkldnn.test_activation #12377

Closed lebeg closed 6 years ago

lebeg commented 6 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1529/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 297, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 293, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 2.232502 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 0), a=0.445562, b=0.693506
 NUMERICAL_data: array([[[0.44556212, 0.2619341 , 0.77837706],
        [0.        , 0.8214429 , 0.5259812 ],
        [0.        , 0.        , 0.        ]],...
 BACKWARD_data: array([[[0.693506  , 0.26193386, 0.77837765],
        [0.        , 0.8214439 , 0.52598166],
        [0.        , 0.        , 0.        ]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1284728931 to reproduce.
--------------------- >> end captured logging << ---------------------
pengzhao-intel commented 6 years ago

@luobao-intel please take a look for the reason.

luobao-intel commented 6 years ago

This test is to validate the activation calculation in mkldnn by checking the gradient compared to the theano.gradient.numeric_grad. However, the activation gradient calculation of code referring to theano is not correct with the input closed to zero. Thus, flaky errors occurred when there are some extremely small positive numbers in the random input vector. The experiment is as follows.

experiment 1:

input data :[[1, 2], [3, 0.0001]]

location: {'data': <RowSparseNDArray 2x2 @cpu(0)>, '__random_proj': [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]] <NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano : [[0.35466552 0.8954048 ] [0.40476322 0.39395675]]

mkldnn : [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]]

experiment 2:

input data :[[1, -2], [-4, 0.0005]]

location: {'data': <RowSparseNDArray 2x2 @cpu(0)>, '__random_proj': [[0.3546685 0.8954062 ] [0.40476447 0.7724642 ]] <NDArray 2x2 @cpu(0)>}

gradient calculation referring to theano : [[0.35466552 0. ] [0. 0.4248553 ]]

mkldnn : [[0.3546685 0. ] [0. 0.7724642]]

analysis

It's easy to know that the derivative of ReLU function is : if x < 0, output is 0. if x > 0, output is 1.

Therefore, in the check_numeric_gradient function, the gradient of executor should be equal to location if the corresponding element of input data is positive and be 0 otherwise by element-wise. The gradient based on theano is apparently false when the corresponding element of input data is close to zero.

pengzhao-intel commented 6 years ago

The reference checker applied the finite difference method but the eps is too large for float datatype in here. In @luobao-intel case, the input data is about xe-5, so the eps can't calculate correctly. I suggest changing eps to 1e-6. @luobao-intel will fill the PR soon.

https://github.com/apache/incubator-mxnet/blob/e2a3eef349cb6643c08a7840d8cbd43b38fedfd5/python/mxnet/test_utils.py#L716

lebeg commented 6 years ago

Is failing again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1563/pipeline

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.153736 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 2, 1, 1), a=0.119209, b=0.146338
 NUMERICAL_data: array([[[[0.32782555, 0.52154064],
         [0.32782555, 0.        ]],
...
 BACKWARD_data: array([[[[0.31696534, 0.53385574],
         [0.3415597 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce.
--------------------- >> end captured logging << ---------------------
luobao-intel commented 6 years ago

sorry, I can't reproduce the same result with the same random seed XNET_TEST_SEED=304218922. In my trial, the test_activation is ok. The experiment is shown as follows:

experiment

command

export MXNET_TEST_SEED=304218922 python /usr/bin/nosetests tests/python/mkl/test_mkldnn.py:test_activation

log

[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=304218922 to reproduce. [22:51:22] src/operator/tensor/././../../common/utils.h:450: Storage type fallback detected: operator = Activation input storage types = [row_sparse, ] output storage types = [default, ] params = {"act_type" : relu, } context.dev_mask = cpu The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.

Ran 1 test in 0.023s

OK

anirudhacharya commented 6 years ago

Failing again - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-12391/runs/9/nodes/951/log/?start=0

======================================================================
FAIL: test_mkldnn.test_activation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 298, in test_activation
    check_activation_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 294, in check_activation_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-6, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 912, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.184596 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(0, 0, 1, 0), a=0.715256, b=0.882672
 NUMERICAL_data: array([[[[0.        , 0.        ],
         [0.71525574, 0.        ]],
...
 BACKWARD_data: array([[[[0.        , 0.        ],
         [0.8826717 , 0.        ]],
...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1731055743 to reproduce.
--------------------- >> end captured logging << ---------------------
luobao-intel commented 6 years ago

Sorry for that, in previous situation, we confronted the situation that element of input data is close to zero. And we found that the extremely big difference step, eps, should be blamed. And we turned its value down. However, in current situation, test failure is caused by the small eps for the big element of input data. The smaller the eps is, the more calculation steps are required. And for big input data, every step could cause small error and finally the cumulative error may exceed the limit. Thus, suitable eps should be picked.

After all, those problems are caused by the inaccurate baseline calculation referring to the theano gradient. We are trying to rewrite the test case with other approaches. I suggest to disable the flaky test for the time being.

lebeg commented 6 years ago

PR to disable the test again: https://github.com/apache/incubator-mxnet/pull/12516

azai91 commented 6 years ago

made a PR that addresses just this test (ran 10000 times with different seeds as well) https://github.com/apache/incubator-mxnet/pull/12560.

in regards to @luobao-intel , this is not due to inputs being too large. activation is linear above 0 so this is not due to lack of approximation. in fact we should be able to get an exact solution. the reason the change is causing an error is the fact that with a very small eps the outputs (f(x + eps/2) and f(x - eps/2)) do not have enough precision.

the formula is

grad = (f(x + eps/2)  - f(x - eps/s)) / eps).

since eps was 1e-6 this means the gradient was calculated by differences must be captured below 1e-6.

azai91 commented 6 years ago

tldr: you should never use anything less than 1e-5 as there is not enough precision in the numerator (f(x + eps/2) - f(x - eps/s)) to derive an accurate slope.

lebeg commented 6 years ago

Has been fixed with https://github.com/apache/incubator-mxnet/pull/12418

lebeg commented 6 years ago

Sorry, probably this is the fix: https://github.com/apache/incubator-mxnet/pull/12560