Open igolan opened 5 years ago
@yzhliu Hey I saw MXTVMBridge in the stack trace, what is that used for?
Hi, it also happens in MXNET1.4.1:
1.4.1
Traceback (most recent call last):
File "/Users/XX/PycharmProjects/XX/playground.py", line 40, in <module>
out.backward()
File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2200, in backward
ctypes.c_void_p(0)))
File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator node_1_backward: Error in operator mlp0__cond0_backward: [18:04:48] src/imperative/cached_op.cc:1250: Check failed: in_attrs->size() == bwd_input_eid.size() (3 vs. 2)
Stack trace returned 10 entries:
[bt] (0) 0 libmxnet.so 0x00000001097f0c90 std::__1::__tree<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::__map_value_compare<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, true>, std::__1::allocator<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*> > >::destroy(std::__1::__tree_node<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, void*>*) + 2736
[bt] (1) 1 libmxnet.so 0x00000001097f0a3f std::__1::__tree<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::__map_value_compare<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, true>, std::__1::allocator<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*> > >::destroy(std::__1::__tree_node<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, void*>*) + 2143
[bt] (2) 2 libmxnet.so 0x000000010ae74537 mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 9319
[bt] (3) 3 libmxnet.so 0x000000010b02c302 void mxnet::op::extract_by_loc<mxnet::NDArray>(std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, nnvm::Tuple<long long>, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> >*) + 12514
[bt] (4) 4 libmxnet.so 0x000000010b0239e5 MXTVMBridge + 174165
[bt] (5) 5 libmxnet.so 0x000000010ae4b91e std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 5454
[bt] (6) 6 libmxnet.so 0x000000010ae59556 std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 61830
[bt] (7) 7 libmxnet.so 0x000000010ae518ea std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 29978
[bt] (8) 8 libmxnet.so 0x000000010ae63462 mxnet::CachedOp::SetForwardGraph(mxnet::CachedOp::GraphInfo*, bool, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&) + 15634
[bt] (9) 9 libmxnet.so 0x000000010ae73e38 mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 7528
----------Python Info----------
Version : 3.7.4
Compiler : Clang 10.0.1 (clang-1001.0.46.4)
Build : ('default', 'Jul 9 2019 18:13:23')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.0.3
Directory : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/pip-19.0.3-py3.7.egg/pip
----------MXNet Info-----------
Version : 1.4.1
Directory : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet
Commit Hash : 1a7199691f5cbc6012bb53eecbf884bed5ae6590
Library : ['/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform : Darwin-18.7.0-x86_64-i386-64bit
system : Darwin
node : XX
release : 18.7.0
version : Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64
----------Hardware Info----------
machine : x86_64
processor : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0256 sec, LOAD: 0.6952 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0332 sec, LOAD: 0.0961 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0194 sec, LOAD: 0.5617 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0714 sec, LOAD: 0.4392 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0125 sec, LOAD: 0.2901 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0159 sec, LOAD: 0.0748 sec.
----------Environment----------
We confirm that we can reproduce this bug
A minimal example would be:
import mxnet as mx
from mxnet import nd, autograd, gluon
class MLP(gluon.HybridBlock):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense1 = gluon.nn.Dense(1, in_units=1)
def hybrid_forward(self, F, x):
cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.round(x))
return cond_out
A workaround would be:
import mxnet as mx
from mxnet import nd, autograd, gluon
class MLP(gluon.HybridBlock):
def __init__(self, **kwargs):
super(MLP, self).__init__(**kwargs)
with self.name_scope():
self.dense1 = gluon.nn.Dense(1, in_units=1)
def hybrid_forward(self, F, x, zero):
# `zero` is an ndarray that contains only zero, so `round(x) + relu(zero)` is always `round(x)`
cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.round(x) + F.relu(zero))
return cond_out
Just digged a while with @zheng-da.
This bug should be related to inconsistency between operator implementation (F.round does not require original input when doing backward) and cached op (in which there are pieces of code that assume original input exists). We have a workaround in the post above that forces the input to exist, but this is definitely a bug somewhere unrelated to control flow operator itself. Could someone help us with the fix?
Description
symbol.contrib.cond
operator does not support the build-in operatorsround
,floor
andceil
(and probably some more).Environment info (Required)
I'm using Pyton
Build info (Required if built from source)
N/A
Error Message:
Minimum reproducible example
Steps to reproduce
What have you tried to solve it?
N/A
Might be related to #12154 , #11641 , #16182 and #16187 . I keep those issues separate because I'm not sure the cause is the same.