apache / singa

a distributed deep learning platform
Apache License 2.0
3.34k stars 1.23k forks source link

(i) Some bugs in autograd.py and (ii) test_operation.py needs updates #576

Closed chrishkchris closed 4 years ago

chrishkchris commented 4 years ago

Today when I run the singa/test/python/test_operation.py, I get these errors:

ubuntu@ip-172-31-24-48:~/singa/test/python$ python3 test_operation.py
..................................................................E.FF..............................FF.....FF................
======================================================================
ERROR: test_conv2d_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 216, in test_conv2d_cpu
    y = conv_1(cpu_input_tensor)  # PyTensor
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1380, in __call__
    y = conv2d(self.handle, x, self.W, self.b)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1241, in conv2d
    return _Conv2d(handle)(x, W, b)[0]
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 247, in __call__
    return self._do_forward(*xs)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 298, in _do_forward
    ys = self.forward(*xs)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1203, in forward
    return singa.GpuConvForward(x, W, b, self.handle)
TypeError: in method 'GpuConvForward', argument 4 of type 'singa::CudnnConvHandle const &'

======================================================================
FAIL: test_div_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2616, in test_div_broadcast_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 3.33%
Max absolute difference: 3.0517578e-05
Max relative difference: 9.684139e-07
 x: array([[-1.30722e+01,  2.65515e+00, -6.92423e-02, -2.97908e-01,
         6.12429e+00,  3.71461e-01],
       [ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...
 y: array([[-1.30722e+01,  2.65515e+00, -6.92423e-02, -2.97908e-01,
         6.12429e+00,  3.71461e-01],
       [ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...

======================================================================
FAIL: test_div_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2584, in test_div_broadcast_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 3.51512e-07
 x: array([-173.63599,  -30.95938,  139.375  ,   -4.83802,   -2.26971],
      dtype=float32)
 y: array([-173.63605,  -30.95938,  139.37502,   -4.83802,   -2.26971],
      dtype=float32)

======================================================================
FAIL: test_pow_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2678, in test_pow_broadcast_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 1.3951524e-07
 x: array([ 169.04495, -238.43016, 1852.8772 ,  437.48016,  -20.75186],
      dtype=float32)
 y: array([ 169.04497, -238.43016, 1852.8772 ,  437.48022,  -20.75186],
      dtype=float32)

======================================================================
FAIL: test_pow_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2645, in test_pow_broadcast_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(result), y, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 6.67%
Max absolute difference: 6.1035156e-05
Max relative difference: 8.3724494e-08
 x: array([[[  1.     , 216.     ,  64.     ,  36.     , 343.     ],
        [ 27.     , 125.     , 512.     ,  36.     , 343.     ],
        [  1.     , 343.     ,   1.     ,  81.     , 343.     ],...
 y: array([[[  1., 216.,  64.,  36., 343.],
        [ 27., 125., 512.,  36., 343.],
        [  1., 343.,   1.,  81., 343.],...

======================================================================
FAIL: test_reshape_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 1455, in test_reshape_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

(shapes (2, 3), (3, 2) mismatch)
 x: array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
 y: array([[1., 1.],
       [1., 1.],
       [1., 1.]], dtype=float32)

======================================================================
FAIL: test_reshape_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 1475, in test_reshape_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

(shapes (2, 3), (3, 2) mismatch)
 x: array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
 y: array([[1., 1.],
       [1., 1.],
       [1., 1.]], dtype=float32)

----------------------------------------------------------------------
Ran 125 tests in 0.586s

FAILED (failures=6, errors=1)
chrishkchris commented 4 years ago

A highlight is why the reshaped tensor have different shape? see test_reshape_gpu

Then, other test failure should be due to very small numerical errors (order of 1e-5) that can be fixed by reducing the number of significant in comparison.

dcslin commented 4 years ago

should we make travis build fail when encountering errors raised from python unit test?

chrishkchris commented 4 years ago

should we make travis build fail when encountering errors raised from python unit test?

In my opinion, this is a very good feature. However, I am not sure if the machine that runs the test case by travis has GPU. On the other hand, this test_operation.py is still important because it lets the developers to check whether the system has any problem after their commits.

chrishkchris commented 4 years ago

If I am correct, the reshape is due to the error in backward:

class Reshape(Operation):

    def __init__(self,shape):
        super(Reshape, self).__init__()
        if isinstance(shape, tensor.Tensor):
            self.shape = np.asarray(tensor.to_numpy(shape).astype(np.int32)).tolist()
        else:
            self.shape = list(shape)

    def forward(self, x):
        _shape = x.shape()
        shape = self.shape
        # handle the shape with 0
        shape = [_shape[i] if i < len(_shape) and shape[i] == 0 else shape[i] for i in range(len(shape))]
        # handle the shape with -1
        hidden_shape = int(np.prod(_shape) // np.abs(np.prod(shape)))
        self.cache=[s if s != -1 else hidden_shape for s in shape]
        return singa.Reshape(x, self.cache)

    def backward(self, dy):
        return singa.Reshape(dy, self.cache)

I think the function should change to

class Reshape(Operation):
    def __init__(self,shape):
        super(Reshape, self).__init__()
        if isinstance(shape, tensor.Tensor):
            self.shape = np.asarray(tensor.to_numpy(shape).astype(np.int32)).tolist()
        else:
            self.shape = list(shape)

    def forward(self, x):
        self._shape = x.shape()
        shape = self.shape
        # handle the shape with 0
        shape = [self._shape[i] if i < len(self._shape) and shape[i] == 0 else shape[i] for i in range(len(shape))]
        # handle the shape with -1
        hidden_shape = int(np.prod(self._shape) // np.abs(np.prod(shape)))
        self.cache=[s if s != -1 else hidden_shape for s in shape]

        return singa.Reshape(x, self.cache)

    def backward(self, dy):
        return singa.Reshape(dy, self._shape)
chrishkchris commented 4 years ago

To resolve the problem completely, I opened a hotfix at PR #579

chrishkchris commented 4 years ago

the problem is resolved completely