apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

l2_normalization for fp16 got 0.0 when data is very large #11938

Open TccccD opened 6 years ago

TccccD commented 6 years ago

If the data is set as follow: in_data0 = mx.nd.random.uniform(-5, 5, (512,100000), ctx=mx.gpu(0)) mx.symbol.L2Normalization will got 0.0, whether forward or backward. if it is set in (-1, 1), that is OK. And in mx.symbol.norm, it is OK too.

Test code as follow:

import numpy as np import mxnet as mx from mxnet.test_utils import default_context, assert_almost_equal, check_numeric_gradient import time in_data0 = mx.nd.random.uniform(-5, 5, (512,100000), ctx=mx.gpu(0)) def check_l2_normalizationFP16(mode, dtype, norm_eps=1e-10, isfp16=False): ctx = mx.gpu(0) data = mx.symbol.Variable('data', dtype=dtype) out = mx.symbol.L2Normalization(data=data, mode=mode, eps=norm_eps) out = mx.sym.make_loss(out) in_data = in_data0.astype(dtype) a=time.time() exe = out.simple_bind(ctx=ctx, data=in_data.shape, dtype=dtype) output = exe.forward(is_train=True, data=in_data) exe.backward() symbolic_grads = exe.grad_dict['data'].asnumpy() print('forw---',in_data.dtype,output[0].dtype,'--'+mode,'--') print('grad---',in_data.dtype,symbolic_grads[0].dtype,'--'+mode,'--',100*(time.time()-a)) return output,symbolic_grads

TccccD commented 6 years ago

@haojin2 @piiswrong @leezu @anirudh2290 @szha , please check this issue, and if possible, please help to modify L2Normalization thanks!

leezu commented 6 years ago

@TccccD the reason is that mx.symbol.norm uses a numerically stable algorithm to compute the 2-norm (https://github.com/apache/incubator-mxnet/pull/11573), whereas the L2Normalization is prone to under or overflow. The L2Normalization should be fixed to use the same implementation as the norm Op.

Below a shorter example of the problem


In [15]: a = mx.nd.random.uniform(-5, 5, (512,100000), ctx=mx.gpu(0), dtype='float16')

In [16]: mx.nd.L2Normalization(a)
Out[16]:

[[ 0. -0.  0. ...,  0. -0.  0.]
 [-0.  0. -0. ...,  0.  0.  0.]
 [ 0. -0. -0. ...,  0.  0. -0.]
 ...,
 [-0.  0. -0. ...,  0.  0. -0.]
 [ 0. -0.  0. ...,  0. -0. -0.]
 [-0. -0.  0. ..., -0. -0. -0.]]
<NDArray 512x100000 @gpu(0)>

In [17]: a / mx.nd.norm(a, axis=1, keepdims=True)
Out[17]:

[[  2.19726562e-03  -3.61824036e-03   1.11007690e-03 ...,   3.14950943e-04
   -4.92572784e-04   3.10516357e-03]
 [ -4.07028198e-03   4.61578369e-03  -4.51278687e-03 ...,   2.33650208e-03
    5.40542603e-03   3.78608704e-03]
 [  5.27572632e-03  -1.81293488e-03  -1.17683411e-03 ...,   1.86920166e-03
    4.87518311e-03  -3.04412842e-03]
 ...,
 [ -4.39834595e-03   3.74794006e-04  -4.21905518e-03 ...,   1.11007690e-03
    3.81278992e-03  -3.80134583e-03]
 [  7.90953636e-05  -5.31387329e-03   4.95910645e-03 ...,   3.52859497e-03
   -2.10952759e-03  -4.76837158e-04]
 [ -4.53186035e-03  -3.03459167e-03   2.37083435e-03 ...,  -3.93295288e-03
   -4.21524048e-03  -5.36727905e-03]]
<NDArray 512x100000 @gpu(0)>
TccccD commented 6 years ago

I define a new square like this:

MXNET_BINARY_MATH_OP(square_v, math::sqr(a) / math::sqr(b));

In l2_normalization_op-inl.h, I should find a suitable scale, I think the maximum value in in_data is OK. But I don't know how to find a maximum value in a Tensor data, like Tensor<xpu, 2, DType> data; could you help me? thanks! @haojin2 @piiswrong @leezu @anirudh2290 @szha

leezu commented 6 years ago

@TccccD your contribution to fix the L2Norm Op would be very welcome. Instead of trying to find a suitable scale a-priori (by e.g. looking for the max element) we can also use the scaled sum of squares algorithm added here https://github.com/apache/incubator-mxnet/pull/11573/files#diff-c8275a550b65b889051bd88c27d1e1b7R880 I'm not sure if we can easily use the Reducer interface in legacy-ops though

TccccD commented 6 years ago

I tried it , but feels difficult. This may change a lot of the code。 @leezu

leezu commented 6 years ago

Ok. In general, it is planned to refactor the L2Norm Op completely to improve the interface exposed, but I believe no-one is working on it yet. If/while that is done, making use of the stable Reducer interface would be very easy.

Roshrini commented 6 years ago

@anirudh2290 Can you please add label to this issue: Operator, FeatureRequest