apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

symbol.contrib.cond does not support some built-in operators #16188

Open igolan opened 5 years ago

igolan commented 5 years ago

Description

symbol.contrib.cond operator does not support the build-in operators round, floor and ceil (and probably some more).

Environment info (Required)

----------Python Info----------
Version      : 3.7.4
Compiler     : Clang 10.0.1 (clang-1001.0.46.4)
Build        : ('default', 'Jul  9 2019 18:13:23')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/pip-19.0.3-py3.7.egg/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet
Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
Library      : ['/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Darwin-18.7.0-x86_64-i386-64bit
system       : Darwin
node         : XXX
release      : 18.7.0
version      : Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0137 sec, LOAD: 0.5112 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0180 sec, LOAD: 0.4525 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0198 sec, LOAD: 0.8612 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0233 sec, LOAD: 0.1894 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0120 sec, LOAD: 0.3173 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0105 sec, LOAD: 0.0961 sec.
----------Environment----------

I'm using Pyton

Build info (Required if built from source)

N/A

Error Message:

Traceback (most recent call last):
  File "/Users/XX/PycharmProjects/XX/playground.py", line 39, in <module>
    out.backward()
  File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2216, in backward
    ctypes.c_void_p(0)))
  File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator node_1_backward: Error in operator mlp0__cond0_backward: [12:15:35] src/imperative/cached_op.cc:1322: Check failed: in_attrs->size() == bwd_input_eid.size() (3 vs. 2) : 
Stack trace:
  [bt] (0) 1   libmxnet.so                         0x00000001143d3929 mxnet::op::NDArrayOpProp::~NDArrayOpProp() + 4473
  [bt] (1) 2   libmxnet.so                         0x0000000115939ccf mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 9439
  [bt] (2) 3   libmxnet.so                         0x0000000115ae40e2 void mxnet::op::extract_by_loc<mxnet::NDArray>(std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, mxnet::Tuple<long long>, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> >*) + 3586
  [bt] (3) 4   libmxnet.so                         0x0000000115ae0ad5 MXTVMBridge + 168405
  [bt] (4) 5   libmxnet.so                         0x000000011591166e std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 6094
  [bt] (5) 6   libmxnet.so                         0x000000011591ecb6 std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 60950
  [bt] (6) 7   libmxnet.so                         0x000000011591763a std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 30618
  [bt] (7) 8   libmxnet.so                         0x0000000115929432 mxnet::CachedOp::SetForwardGraph(mxnet::CachedOp::GraphInfo*, bool, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&) + 12402
  [bt] (8) 9   libmxnet.so                         0x00000001159395ab mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 7611

Minimum reproducible example

import mxnet as mx
from mxnet import nd, autograd, gluon

class MLP(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        with self.name_scope():
            self.dense1 = gluon.nn.Dense(1, in_units=1)

    def hybrid_forward(self, F, x):
        # Not working:
        cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.round(x))
        # cond_out = F.contrib.cond(F.ones(1) != F.ones(1), lambda: self.dense1(x), lambda: F.round(x))
        # cond_out = F.contrib.cond(F.ones(1) != F.ones(1), lambda: self.dense1(x), lambda: F.floor(x))
        # cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.floor(x))
        # cond_out = F.contrib.cond(F.ones(1) != F.ones(1), lambda: self.dense1(x), lambda: F.ceil(x))
        # cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.ceil(x))

        # Working:
        # cond_out = F.contrib.cond(F.ones(1) != F.ones(1), lambda: self.dense1(x), lambda: F.relu(x))
        # cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.relu(x))
        # cond_out = F.round(x)
        # cond_out = F.floor(x)
        # cond_out = F.ceil(x)
        # cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.sign(x))
        # cond_out = F.contrib.cond(F.ones(1) != F.ones(1), lambda: self.dense1(x), lambda: F.sign(x))

        cond_out = F.broadcast_mul(cond_out, self.dense1(x))
        return cond_out

model_ctx = mx.cpu()
net = MLP()
net.hybridize()
net.collect_params().initialize(mx.init.Constant([1]), ctx=model_ctx)
data = nd.ones((3,1)) * 1.7
with mx.autograd.record():
    out = net(data.as_in_context(model_ctx))
out.backward()
print(net.dense1.weight.grad())
with mx.autograd.record():
    out = net(data.as_in_context(model_ctx))
out.backward()
print(net.dense1.weight.grad())

Steps to reproduce

  1. Run the code above. hybrid_forward has few lines with comment if it's working or not (uncomment lines to see more examples)

What have you tried to solve it?

N/A

Might be related to #12154 , #11641 , #16182 and #16187 . I keep those issues separate because I'm not sure the cause is the same.

junrushao commented 5 years ago

@yzhliu Hey I saw MXTVMBridge in the stack trace, what is that used for?

igolan commented 5 years ago

Hi, it also happens in MXNET1.4.1:

Output:

1.4.1
Traceback (most recent call last):
  File "/Users/XX/PycharmProjects/XX/playground.py", line 40, in <module>
    out.backward()
  File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 2200, in backward
    ctypes.c_void_p(0)))
  File "/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator node_1_backward: Error in operator mlp0__cond0_backward: [18:04:48] src/imperative/cached_op.cc:1250: Check failed: in_attrs->size() == bwd_input_eid.size() (3 vs. 2) 

Stack trace returned 10 entries:
[bt] (0) 0   libmxnet.so                         0x00000001097f0c90 std::__1::__tree<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::__map_value_compare<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, true>, std::__1::allocator<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*> > >::destroy(std::__1::__tree_node<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, void*>*) + 2736
[bt] (1) 1   libmxnet.so                         0x00000001097f0a3f std::__1::__tree<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::__map_value_compare<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, true>, std::__1::allocator<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*> > >::destroy(std::__1::__tree_node<std::__1::__value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, mxnet::NDArrayFunctionReg*>, void*>*) + 2143
[bt] (2) 2   libmxnet.so                         0x000000010ae74537 mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 9319
[bt] (3) 3   libmxnet.so                         0x000000010b02c302 void mxnet::op::extract_by_loc<mxnet::NDArray>(std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, nnvm::Tuple<long long>, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> >*) + 12514
[bt] (4) 4   libmxnet.so                         0x000000010b0239e5 MXTVMBridge + 174165
[bt] (5) 5   libmxnet.so                         0x000000010ae4b91e std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 5454
[bt] (6) 6   libmxnet.so                         0x000000010ae59556 std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 61830
[bt] (7) 7   libmxnet.so                         0x000000010ae518ea std::__1::__tree<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__map_value_compare<unsigned long, std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::less<unsigned long>, true>, std::__1::allocator<std::__1::__value_type<unsigned long, mxnet::NDArray> > >::erase(std::__1::__tree_const_iterator<std::__1::__value_type<unsigned long, mxnet::NDArray>, std::__1::__tree_node<std::__1::__value_type<unsigned long, mxnet::NDArray>, void*>*, long>) + 29978
[bt] (8) 8   libmxnet.so                         0x000000010ae63462 mxnet::CachedOp::SetForwardGraph(mxnet::CachedOp::GraphInfo*, bool, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&) + 15634
[bt] (9) 9   libmxnet.so                         0x000000010ae73e38 mxnet::CachedOpBackward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::NDArray, std::__1::allocator<mxnet::NDArray> > const&) + 7528

ENV INFO:

----------Python Info----------
Version      : 3.7.4
Compiler     : Clang 10.0.1 (clang-1001.0.46.4)
Build        : ('default', 'Jul  9 2019 18:13:23')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/pip-19.0.3-py3.7.egg/pip
----------MXNet Info-----------
Version      : 1.4.1
Directory    : /Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet
Commit Hash   : 1a7199691f5cbc6012bb53eecbf884bed5ae6590
Library      : ['/Users/XX/PycharmProjects/XX/venv/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
No runtime build feature info available
----------System Info----------
Platform     : Darwin-18.7.0-x86_64-i386-64bit
system       : Darwin
node         : XX
release      : 18.7.0
version      : Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0256 sec, LOAD: 0.6952 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0332 sec, LOAD: 0.0961 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0194 sec, LOAD: 0.5617 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0714 sec, LOAD: 0.4392 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0125 sec, LOAD: 0.2901 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0159 sec, LOAD: 0.0748 sec.
----------Environment----------
junrushao commented 5 years ago

We confirm that we can reproduce this bug

junrushao commented 5 years ago

A minimal example would be:

import mxnet as mx
from mxnet import nd, autograd, gluon

class MLP(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        with self.name_scope():
            self.dense1 = gluon.nn.Dense(1, in_units=1)

    def hybrid_forward(self, F, x):
        cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.round(x))
        return cond_out
junrushao commented 5 years ago

A workaround would be:

import mxnet as mx
from mxnet import nd, autograd, gluon

class MLP(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        with self.name_scope():
            self.dense1 = gluon.nn.Dense(1, in_units=1)

    def hybrid_forward(self, F, x, zero):
        # `zero` is an ndarray that contains only zero, so `round(x) + relu(zero)` is always `round(x)`
        cond_out = F.contrib.cond(F.ones(1) == F.ones(1), lambda: self.dense1(x), lambda: F.round(x) + F.relu(zero))
        return cond_out
junrushao commented 5 years ago

Just digged a while with @zheng-da.

This bug should be related to inconsistency between operator implementation (F.round does not require original input when doing backward) and cached op (in which there are pieces of code that assume original input exists). We have a workaround in the post above that forces the input to exist, but this is definitely a bug somewhere unrelated to control flow operator itself. Could someone help us with the fix?