shape input names order mismatch after partitioning

samskalicky commented 5 years ago

Description

The input names of a symbol are produced by a DFS traversal of the symbol's graph from the outputs back up to the inputs. During graph partitioning, some nodes are added to subgraphs, thus potentially changing the order of the DFS traversal. After graph partitioning, shape propagation occurs, and the inferred shapes for the inputs are returned in the order that they appear in a DFS traversal.

However, when graph partitioning happens and the DFS traversal order changes, the inferred shapes may be returned in a different order than expected. Since the original symbol is not modified, the caller is expecting the shapes in the same order as the original symbol.

Since DFS order is not guaranteed to be identical before and after partitioning, we need to map the names-to-shapes and ensure that the shapes are returned in the original order.

Environment info (Required)

The error occurs on every release, and is reproducible on the master branch. I have built from source using the master branch and reproduced the problem.

Error Message:

Traceback (most recent call last):
  File "run.py", line 139, in <module>
    mod.set_params(arg_params, aux_params, allow_missing=True)
  File "/home/ubuntu/mxnet/python/mxnet/module/module.py", line 358, in set_params
    self._exec_group.set_params(arg_params, aux_params, allow_extra=allow_extra)
  File "/home/ubuntu/mxnet/python/mxnet/module/executor_group.py", line 413, in set_params
    exec_.copy_params_from(arg_params, aux_params, allow_extra_params=allow_extra)
  File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 361, in copy_params_from
    array.astype(dst.dtype).copyto(dst)
  File "/home/ubuntu/mxnet/python/mxnet/ndarray/ndarray.py", line 2089, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/home/ubuntu/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:29:46] src/operator/random/./../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node  at 0-th output: expected [1,1,128,128,60], got [15,1024,1,1]
Stack trace:
  [bt] (0) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe5684779a2]
  [bt] (1) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseAttr<mxnet::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*) const+0x2202) [0x7fe56868d322]
  [bt] (2) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<1l, 1l>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x410) [0x7fe568692db0]
  [bt] (3) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0xe8a) [0x7fe56a74d87a]
  [bt] (4) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x368) [0x7fe56a753a28]
  [bt] (5) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0xb2a) [0x7fe56ae4fd6a]
  [bt] (6) /home/ubuntu/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x534) [0x7fe56ae518f4]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe578ef5e40]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fe578ef58ab]

Minimum reproducible example

This problem occurs on a few models, the one that I can share is the faster-rcnn model from the GluonCV package. Here is how to get the model:

#get model
import gluoncv as cv
model = cv.model_zoo.faster_rcnn_resnet50_v1b_coco(pretrained=True)
im_fname = cv.utils.download('https://github.com/dmlc/web-data/blob/master/gluoncv/detection/biking.jpg?raw=true', path='biking.jpg')

x, orig_img = cv.data.transforms.presets.rcnn.load_test(im_fname)
model.hybridize()
box_ids, scores, bboxes = model(x)
model.export('faster-rcnn')

Once the model is exported, here is the code to reproduce the error using CPU context:

import mxnet as mx
import numpy as np
from collections import namedtuple
Batch = namedtuple('Batch', ['data'])
import os
from mxnet.base import _LIB, check_call, c_str, mx_uint, c_str_array

op_names = [
    "_add",
            "_contrib_MultiBoxDetection",
            "_contrib_MultiBoxPrior",
            "_contrib_MultiBoxTarget",
            "_copy",
            "_div_scalar",
            "_DivScalar",
            "_minus",
            "_Minus",
            "_minus_scalar",
            "_MinusScalar",
            "_mul",
            "_Mul",
            "_mul_scalar",
            "_MulScalar",
            "_plus",
            "_Plus",
            "_plus_scalar",
            "_PlusScalar",
            "_rdiv_scalar",
            "_RDivScalar",
            "_rminus_scalar",
            "_RMinusScalar",
            "_rnn_param_concat",
            "_sub",
            "abs",
            "Activation",
            "arccos",
            "arccosh",
            "arcsin",
            "arcsinh",
            "arctan",
            "arctanh",
            "argmax",
            "argmin",
            "BatchNorm",
            "BatchNorm_v1",
            "BlockGrad",
            "broadcast_add",
            "broadcast_equal",
            "broadcast_greater",
            "broadcast_greater_equal",
            "broadcast_lesser",
            "broadcast_lesser_equal",
            "broadcast_mul",
            "broadcast_not_equal",
            "broadcast_plus",
            "cast",
            "Cast",
            "clip",
            "concat",
            "Concat",
            "Convolution",
            "Convolution_v1",
            "cos",
            "cosh",
            "crop",
            "Deconvolution",
            "Dropout",
            "elemwise_add",
            "elemwise_mul",
            "elemwise_sub",
            "Embedding",
            "exp",
            "expand_dims",
            "flatten",
            "Flatten",
            "flip",
            "FullyConnected",
            "identity",
            "identity",
            "LeakyReLU",
            "LinearRegressionOutput",
            "log",
            "log_softmax",
            "LRN",
            "make_loss",
            "MakeLoss",
            "max",
            "max_axis",
            "mean",
            "min",
            "min_axis",
            "negative",
            "one_hot",
            "pad",
            "Pad",
            "pick",
            "Pooling",
            "Pooling_v1",
            "prod",
            "reciprocal",
            "relu",
            "repeat",
            "reshape",
            "Reshape",
            "reverse",
            "RNN",
            "rsqrt",
            "sigmoid",
            "sin",
            "sinh",
            "slice",
            "SliceChannel",
            "softmax",
            "SoftmaxActivation",
            "SoftmaxOutput",
            "softmin",
            "split",
            "sqrt",
            "sum",
            "sum_axis",
            "tan",
            "tanh",
            "tile",
            "topk",
            "transpose",
            "zeros_like"
]
check_call(_LIB.MXSetSubgraphPropertyOpNames(c_str("default"),
                                             mx_uint(len(op_names)),
                                             c_str_array(op_names)))

os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

ctx = mx.cpu()

sym, arg_params, aux_params = mx.model.load_checkpoint('faster-rcnn', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))],label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true')
img = mx.image.imread(fname)

# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 224, 224) # resize
img = img.transpose((2, 0, 1)) # Channel first
img = img.expand_dims(axis=0) # batchify

mod.forward(Batch([img]))
print(mod.get_outputs())

What have you tried to solve it?

Ive tested a fix in a private branch: https://github.com/samskalicky/incubator-mxnet/commit/517d29498059d081873d1bd160d95479a5c8cea9

ZhennanQin commented 5 years ago

During graph partitioning, some nodes are added to subgraphs, thus potentially changing the order of the DFS traversal.

Previously, I thought graph partitioning from default backend won't change input order. Can you explain more about the input order change? like, which op in op_names cause this, or which kind of op connection cause this? CC @reminisce as he may know something behind.

frankfliu commented 5 years ago

@mxnet-label-bot add [Bug]

samskalicky commented 5 years ago

@ZhennanQin Im not sure why we think the input order should not change after partitioning. It makes sense to me that since we're changing the graph its possible the order may change. Maybe @reminisce can explain the rationale behind that assumption that it should not change. All ive been able to figure out the input order does change, hence the error message above about the shapes mismatch.

reminisce commented 5 years ago

@samskalicky The order should be preserved because that's how I implemented it. If it's changed in your case, there might be a bug. Can you draw a diagram to show which inputs' order is changed?

samskalicky commented 5 years ago

@reminisce can you explain how you implemented it in such a way that the order is preserved?

There is a failing example with the sources in the description, would really appreciate your help debugging through this with your expertise and familiarity.

Here is an example showing the problem:

Assume the initial traversal order is: A -> J A -> J -> G A -> J -> G -> D A -> J -> G -> D -> E A -> J -> G -> D -> F A -> J -> G -> H A -> J -> K A -> B A -> B -> C

And then assume that B, D, and G are included in the subgraph.

Then the post partitioning traversal would be: A -> J A -> J -> S A -> J -> S -> E A -> J -> S -> F A -> J -> S -> H A -> J -> S -> C A -> J -> K

Thus the order would change from: [E, F, H, K, C] to: [E, F, H, C, K]

reminisce commented 5 years ago

@samskalicky Thanks for the diagram. This is indeed the case that breaks the order of inputs and it's not considered in the implementation. We may need the solution in your PR to solve this.

ZhennanQin commented 5 years ago

So it's a common case from all backend that the input order may get change after partitioning. So we should consider this in whole partition flow, not for a particular backend. @samskalicky Thanks for finding and fixing this.

ZhennanQin commented 5 years ago

I believe this issue is fixed in master, below is my result:

[09:17:25] src/executor/graph_executor.cc:1936: Subgraph backend default is activated.
[09:17:25] src/executor/graph_executor.cc:1735: SubgraphPropertyOpNameSet for subgraph property default has been assigned a value. Please make sure it is initialized only for the testing purpose.
[
[[[ 0.]
  [ 0.]
  [ 0.]
  ...
  [-1.]
  [-1.]
  [-1.]]]
<NDArray 1x80000x1 @cpu(0)>, 
[[[ 0.05506234]
  [ 0.05506234]
  [ 0.05506234]
  ...
  [-1.        ]
  [-1.        ]
  [-1.        ]]]
<NDArray 1x80000x1 @cpu(0)>, 
[[[ -9.554436 224.       236.60925  224.      ]
  [ -9.554436 224.       236.60925  224.      ]
  [ -9.554436 224.       236.60925  224.      ]
  ...
  [ -1.        -1.        -1.        -1.      ]
  [ -1.        -1.        -1.        -1.      ]
  [ -1.        -1.        -1.        -1.      ]]]
<NDArray 1x80000x4 @cpu(0)>]

Result is the same with quoting out os.environ['MXNET_SUBGRAPH_BACKEND'] = 'default'

apache / mxnet