apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Weird gluon hybridize bug #14228

Open yifeim opened 5 years ago

yifeim commented 5 years ago

Description

On a set of network/input configurations, the hybridized model returns a weird bug: Check failed: g.GetAttr<size_t>("storage_type_num_unknown_nodes") == 0U (1 vs. 0) Instead, this does not seem to be the true cause in any conceivable failure cases.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.14.77-70.82.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-16-72-155
release      : 4.14.77-70.82.amzn1.x86_64
version      : #1 SMP Mon Dec 3 20:01:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0021 sec, LOAD: 0.5902 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0663 sec, LOAD: 0.8082 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0570 sec, LOAD: 0.4872 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0089 sec, LOAD: 0.0834 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0171 sec, LOAD: 0.5700 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0084 sec, LOAD: 0.0627 sec.

Package used (Python/R/Scala/Julia): Python

Minimum reproducible example

import mxnet as mx

class Net(mx.gluon.HybridBlock):
    def __init__(self):
        super(Net, self).__init__()
        self.dense = mx.gluon.nn.Dense(5)

    def hybrid_forward(self, F, x, y):
        return self.dense(x), y

class Net2(mx.gluon.HybridBlock):
    def __init__(self):
        super(Net2, self).__init__()
        self.encoder = mx.gluon.nn.Embedding(5, 5)
        self.core = Net()
        self.dense = mx.gluon.nn.Dense(5)

    def hybrid_forward(self, F, x, y):
        x = self.encoder(x)
        x,y = self.core(x,y)
        x = self.dense(x)
        return x, y

net = Net2()
net.initialize()
net.hybridize()
a = mx.nd.ones((5,5))
b = mx.nd.ones((5,5))
with mx.autograd.record():
    c, d = net(a, b)
c.backward()
mx.nd.waitall()
print('pass')

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. run the above script
  2. observe error MXNetError: Error in operator node_5_backward: [01:30:17] src/imperative/./imperative_utils.h:684: Check failed: g.GetAttr<size_t>("storage_type_num_unknown_nodes") == 0U (1 vs. 0)

What have you tried to solve it?

  1. Wrap y with F.identity(y) on the inner network.
  2. Having nested networks seem more easily to lead to errors.
mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Gluon, Bug

yifeim commented 5 years ago

@eric-haibin-lin helped with identifying the actual possible cause of the otherwise obscure error code.

frankfliu commented 5 years ago

@mxnet-label-bot add [Bug, Gluon]

ifeherva commented 5 years ago

@eric-haibin-lin I ran into the same issue, what was the solution in your case?

eric-haibin-lin commented 5 years ago

@ifeherva could you share you failed test case? "Wrap y with F.identity(y) on the inner network." -> this seems to work for @yifeim 's case

chinakook commented 5 years ago

Wrapping y with F.identity(y) on the outer network also works. I think y, the input tensor, it cannot be backward because it does not pass through any operator.

chinakook commented 5 years ago

I found another problem like following:

import mxnet as mx

class Net(mx.gluon.HybridBlock):
    def __init__(self):
        super(Net, self).__init__()
    def hybrid_forward(self, F, x):
        x = F.relu(x)
        return x

net = Net()
net.initialize()
net.hybridize()
a = mx.nd.ones((5,5))
with mx.autograd.record():
    c = net(a)
c.backward()
mx.nd.waitall()
print('pass')

Error infomation:

line 16, in <module>
    c.backward()
  File "C:\ProgramData\Anaconda3\lib\site-packages\mxnet\ndarray\ndarray.py", line 2192, in backward
    ctypes.c_void_p(0)))
  File "C:\ProgramData\Anaconda3\lib\site-packages\mxnet\base.py", line 251, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [21:29:18] C:\Jenkins\workspace\mxnet-tag\mxnet\src\imperative\imperative.cc:285: Check failed: !AGInfo::IsNone(*i) Cannot differentiate node because it is not in a computational graph. You need to set is_recording to true or use autograd.record() to save computational graphs for backward. If you want to differentiate the same graph twice, you need to pass retain_graph=True to backward.
eric-haibin-lin commented 5 years ago

@chinakook for this I think you need to do a.attach_grad() before autograd?

chinakook commented 5 years ago

@eric-haibin-lin Yeah.

anirudhacharya commented 5 years ago

can this issue be closed?

yifeim commented 5 years ago

No, these two problems seem unrelated. In fact, I have two points of confusion now: (1) The initial post had different error messaging.

MXNetError: Error in operator node_5_backward: [01:30:17] src/imperative/./imperative_utils.h:684: Check failed: g.GetAttr<size_t>("storage_type_num_unknown_nodes") == 0U (1 vs. 0)

(2) The workaround did not trigger the second error, even though variable y was still no in the computational graph.

piyushghai commented 5 years ago

@yifeim There seems to be a workaround with wrapping inner network's y in F.identity() . Are you still facing the error ? If not, can this issue be closed ?

@mxnet-label-bot Update [pending requester info]