Open leezu opened 6 years ago
Thanks for opening this issue @leezu @sandeep-krishnamurthy could you add label "Operator", "bug", "cuda" to this?
@leezu Fix is merged, please close the issue when you feel comfortable to do so. Thanks!
Thanks @haojin2 . I was forcing the use of the AddTakeGrad for the past month and confirm that it solves the issue. Now that everyone agreed to make this the default behavior the issue can be closed.
In fact, the correct way to solve the problem should be to try to use the mshadow version or fix the bug in the CUDA code.
Simply removing the implementation causes performance downgrade. https://github.com/apache/incubator-mxnet/issues/16001
Thanks for the effort.
Is it fine to set MXNET_FORCE_ADDTAKEGRAD=0
before the run?
@mahmoodn I’m afraid not. The fix just removed the usage of AddTakeGradLargeBatch.
Looking at https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/indexing_op.h#L537, I see
if (req[embedding::kWeight] == kWriteTo || req[embedding::kWeight] == kAddTo) {
if (req[embedding::kWeight] == kWriteTo) {
grad_in = scalar<DType>(0.0f);
}
AddTakeGrad(grad_in, data, grad_out);
} else {
LOG(FATAL) << "wrong req";
}
which has been changed since the first post in this topic.
Description
The
AddTakeGradLargeBatchCaller
operator called during backward of Embedding is broken and results innan
at random positions in the gradient array.Environment info (Required)
While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge.
Minimum reproducible example
Steps to reproduce
Run above script with cuda 9.2 and observe very frequent nan values:
Run above script with cuda 9.0 and observe (infrequent) nan values:
What have you tried to solve it?
If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use AddTakeGrad independently of gradient input and output shape
src/operator/tensor/indexing_op.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h index 87381960e..d3a1bdfd6 100644 --- a/src/operator/tensor/indexing_op.h +++ b/src/operator/tensor/indexing_op.h @@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs, uint64_t shape_out_prod = static_cast(gradout.shape[0])*
static_cast(gradout.shape[1]);
if (!default_addtakegrad || ( shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384 )) { AddTakeGrad(grad_in, data, grad_out); } else { AddTakeGradLargeBatchCaller(ctx, grad_in, data, grad_out);
2.17.1