apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

[Numpy] Gradient of np.prod is wrong when there are zeros in the array #18078

Open ghost opened 4 years ago

ghost commented 4 years ago

Description

I investigated the MXNet C++ code and noticed that the gradient of the input array [[x, y]] is computed as [[(x*y)/x, (x*y)/y]]. Therefore, if y is zero, then we always get nan as the second element of the array. Is it a bug?

Error Message

No error message

To Reproduce

This is a python program I run:

import mxnet as mx
sym = mx.symbol.prod(data=mx.symbol.Variable("in"), axis=(1))
exec = sym.bind(mx.cpu(0), {"in":mx.nd.array([[4, 0]])}, {"in":mx.nd.array([[9, 9]])})
out = exec.forward()[0]
exec.backward(mx.nd.ones(out.shape))
print('-- Gradient:')
print(exec.grad_dict['in'])

I get this output:

-- Gradient:
[[ 0. nan]]
<NDArray 1x2 @cpu(0)>

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

paste outputs here

----------Python Info----------
Version      : 3.7.7
Compiler     : Clang 11.0.0 (clang-1100.0.33.17)
Build        : ('default', 'Mar 10 2020 15:43:03')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /usr/local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /Users/zibi22/MXNet-1.6.0/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-18.7.0-x86_64-i386-64bit
system       : Darwin
node         : zibi22maclap
release      : 18.7.0
version      : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0756 sec, LOAD: 0.9306 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0006 sec, LOAD: 0.8133 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3825 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0006 sec, LOAD: 0.1266 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0008 sec, LOAD: 0.1243 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0009 sec, LOAD: 0.3131 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0731 sec, LOAD: 0.9831 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.00061798095703125 sec.
sxjscience commented 4 years ago

@zleyk22 Yes, it's a bug and also exists in numpy.

import mxnet as mx
mx.npx.set_np()
a = mx.np.array([1,0])
a.attach_grad()
with mx.autograd.record():
    b = mx.np.prod(a)
    b.backward()
print(a.grad)

Output:

[ 0. nan]
ghost commented 4 years ago

Xingjian,

If it is a bug, then possibly somebody tries to fix it. Do you know such a person? I would like to contact him/her.

But if nobody is fixing this bug, then I volunteer to fix it. And as I see it now it may not be an easy task.

Zbigniew

On 4/15/20 11:19 PM, Xingjian Shi wrote:

@zleyk22 https://github.com/zleyk22 Yes, it's a bug and also exists in numpy.

import mxnetas mx mx.npx.set_np() a= mx.np.array([1,0]) a.attach_grad() with mx.autograd.record(): b= mx.np.prod(a) b.backward() print(a.grad)

Output:

|[ 0. nan] |

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614406630, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3PCDZ7IXHMQE6L7BMYYDTRM2BMRANCNFSM4MJCT54Q.

sxjscience commented 4 years ago

@zleyk22 I think currently no one is looking at the issue. Would you try to solve it? Thanks for pointing out the problem!

ghost commented 4 years ago

Yes, I will try to fix it.

On 4/16/20 11:44 AM, Xingjian Shi wrote:

@zleyk22 https://github.com/zleyk22 I think currently no one is looking at the issue. Would you try to solve it? Thanks for pointing out the problem!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614766460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3PCD2XA7POGEXRQ4L2SK3RM4YYDANCNFSM4MJCT54Q.

sxjscience commented 4 years ago

@zleyk22 I can think of two possible ways to solve this problem: 1) Use two cumsums

yzhliu commented 4 years ago

@zleyk22 are you working on this?

ghost commented 4 years ago

Yes, and no. I have been looking into this code, but there are many parts of the code which are not clear for me. Therefore, I am not sure how to fix it in a reasonable way in order not to introduce more bugs. I was investigating how this code was written. What I saw, this code was written by Eric Junyuan Xie in the end of 2016. I thought that going back in time I would learn how this code was written and would find some comments and explanations. But there was nothing which suggested why the author chose that way of implementing. So currently, I am not going to fix it. I prefer to wait a while in order to figure out what to do next. Maybe it would be advisable to talk to Eric Junyuan Xie to learn more about his code?

On 5/13/20 3:45 PM, Yizhi Liu wrote:

@zleyk22 https://github.com/zleyk22 are you working on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-628236372, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3PCD67QIMEOXF2TMBRE3LRRMBG5ANCNFSM4MJCT54Q.

matteosal commented 4 years ago

@yzhliu Me and @zleyk22 spent some time investigating this and tracking it down. We managed to track it down to this line. OP::Map(data[i], DType(out[out_idx])) computes DType(out[out_idx]) / data[i], which returns nan when data[i] is zero (our input is mx.nd.array([[4, 0]])).

The logic behind this computation is that (unless zeros are involved) dividing x*y*z by e.g. x gives you y*z which is the desired gradient. But this trick doesn't work when zeros are involved. In this case, the code should just recompute the entire product, leaving x out.

But it seems that this infrastructure is quite rigid and only allows to re-use the previously-computed output and one single input element at a time. So it seems that fixing this bug would require some changes to this system, which is shared by more than one operator. Due to our low familiarity with the codebase, we don't think we are able to proceed further without a bit of guidance and help.

Can you suggest a way forward? Thanks!