apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values #11314

Open leezu opened 6 years ago

leezu commented 6 years ago

Description

The AddTakeGradLargeBatchCaller operator called during backward of Embedding is broken and results in nan at random positions in the gradient array.

Environment info (Required)

While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge.

Minimum reproducible example

import mxnet as mx
import numpy as np

N = 50000
ctx = mx.gpu()

embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)
i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)

got_nan = False
while True:
    i += 1
    with mx.autograd.record():
        emb_in = embedding(idx)
        loss = emb_in.sum()
    loss.backward()

    if not np.all(np.isfinite(embedding.weight.grad().asnumpy())):
        nan_rows = np.where(~np.isfinite(embedding.weight.grad().asnumpy()))[0]
        print(f'Got nan {i}\tRetrying with same data. '
              f'(Affected rows: {nan_rows.tolist()}).')
        got_nan = True
    else:
        if got_nan:  # We got nan before and it disappeared now
            print(f'nan disappeared in {i}..')
            break

    if i % 100 == 0:
        print(f'{i}')

Steps to reproduce

Run above script with cuda 9.2 and observe very frequent nan values:

% python debug_embedding_nan.py
Got nan 3       Retrying with same data. (Affected indices: [14721, 14721], [1, 2]).
Got nan 4       Retrying with same data. (Affected indices: [20, 20, 39, 39, 18232, 18232], [257, 258, 1, 2, 1, 2]).
Got nan 5       Retrying with same data. (Affected indices: [20, 20, 71, 33346, 38015], [257, 258, 258, 130, 130]).
Got nan 6       Retrying with same data. (Affected indices: [20, 20], [257, 258]).
nan disappeared in 7..
% python debug_embedding_nan.py 
Got nan 7       Retrying with same data. (Affected indices: [20, 20, 33, 71, 71, 71, 71, 71], [257, 258, 1, 1, 2, 129, 130, 258]).
nan disappeared in 8..
% python debug_embedding_nan.py
Got nan 1       Retrying with same data. (Affected indices: [1489], [129]).
Got nan 2       Retrying with same data. (Affected indices: [42581, 42581], [257, 258]).
nan disappeared in 3..

Run above script with cuda 9.0 and observe (infrequent) nan values:

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
Got nan 1461    Retrying with same data. (Affected indices: [3254], [2]).
nan disappeared in 1462..

What have you tried to solve it?

  1. Apply the following patch and set
    
    From 3fd91f0078e70cf990ce1549081c03cfb50292ad Mon Sep 17 00:00:00 2001
    From: Leonard Lausen <leonard@lausen.nl>
    Date: Fri, 15 Jun 2018 18:45:39 +0000
    Subject: [PATCH] MXNET_FORCE_ADDTAKEGRAD to disable
    AddTakeGradLargeBatchCaller

If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use AddTakeGrad independently of gradient input and output shape

src/operator/tensor/indexing_op.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h index 87381960e..d3a1bdfd6 100644 --- a/src/operator/tensor/indexing_op.h +++ b/src/operator/tensor/indexing_op.h @@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs, uint64_t shape_out_prod = static_cast(gradout.shape[0])* static_cast(gradout.shape[1]);



2. Run above script when `MXNET_FORCE_ADDTAKEGRAD=1` is exported.
kalyc commented 6 years ago

Thanks for opening this issue @leezu @sandeep-krishnamurthy could you add label "Operator", "bug", "cuda" to this?

haojin2 commented 6 years ago

@leezu Fix is merged, please close the issue when you feel comfortable to do so. Thanks!

leezu commented 6 years ago

Thanks @haojin2 . I was forcing the use of the AddTakeGrad for the past month and confirm that it solves the issue. Now that everyone agreed to make this the default behavior the issue can be closed.

sxjscience commented 5 years ago

In fact, the correct way to solve the problem should be to try to use the mshadow version or fix the bug in the CUDA code.

sxjscience commented 5 years ago

Simply removing the implementation causes performance downgrade. https://github.com/apache/incubator-mxnet/issues/16001

mahmoodn commented 5 years ago

Thanks for the effort. Is it fine to set MXNET_FORCE_ADDTAKEGRAD=0 before the run?

sxjscience commented 5 years ago

@mahmoodn I’m afraid not. The fix just removed the usage of AddTakeGradLargeBatch.

mahmoodn commented 5 years ago

Looking at https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/indexing_op.h#L537, I see

      if (req[embedding::kWeight] == kWriteTo || req[embedding::kWeight] == kAddTo) {
        if (req[embedding::kWeight] == kWriteTo) {
          grad_in = scalar<DType>(0.0f);
        }
        AddTakeGrad(grad_in, data, grad_out);
      } else {
        LOG(FATAL) << "wrong req";
      }

which has been changed since the first post in this topic.