Backward shape inconsistent with custom HybridBlock and gluon.loss #8836

Closed MoritzMaxeiner closed 6 years ago

MoritzMaxeiner commented 6 years ago


I've written a custom Gluon HybridBlock, used its output for a Gluon loss, and then tried to call loss.backward(). This works well when the block isn't hybridized, but after calling .hybridize() I get a backward shape inconsistency error.

Environment info (Required)

Package used (Python/R/Scala/Julia): I'm using Python.

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc 7.2.0

MXNet commit hash: c1846ce2ca5003cf613c8bdcf5b0c89d8e0b0d67

Build config:

Error Message:

[18:46:34] src/executor/infer_graph_attr_pass.cc:212: Check failed: (rshape[eid]) == (rshape[idx.entry_id(fnode.inputs[i])]) Backward shape inconsistent with the forward shape fish: “python test.py” terminated by signal SIGABRT (Abort)

Minimum reproducible example

# test.py
import mxnet as mx

class Test(mx.gluon.HybridBlock):
    def __init__(self, hidden_unit_size, seq_length, feature_size, output_size, weight_initializer=None, **kwargs):
        super(Test, self).__init__(**kwargs)
        self.seq_length = seq_length
        self.feature_size = feature_size
        self.hidden_unit_size = hidden_unit_size
        self.output_size = output_size

        self.num_cells = 2
        with  self.name_scope():
            self.cell_a = mx.gluon.rnn.GRUCell(self.hidden_unit_size, input_size=feature_size)
            self.cell_b = mx.gluon.rnn.GRUCell(self.hidden_unit_size, input_size=hidden_unit_size)

    def hybrid_forward(self, F, inputs, states):
        prev_hidden_states = states[0]
        if F is mx.symbol:
            prev_hidden_states = F.split(prev_hidden_states, axis=0, num_outputs=self.num_cells, squeeze_axis=1)

        cell_a_outputs, _ = self.cell_a.unroll(self.seq_length, inputs, [prev_hidden_states[0]])

        cell_b_inputs = [prev_hidden_states[0]] + cell_a_outputs[:-1]
        cell_b_outputs, _ = self.cell_b.unroll(self.seq_length, cell_b_inputs, [prev_hidden_states[1]])

        a_outputs = F.concat(*[F.reshape(a, shape=(0,1,self.feature_size, self.output_size)) for a in cell_a_outputs], dim=1)
        b_outputs = cell_b_outputs[0]

        return a_outputs, b_outputs

    def state_info(self, batch_size=0):
        return [{'shape': (self.num_cells, batch_size, self.hidden_unit_size), '__layout__': 'LNC'}]

    def begin_state(self, batch_size=0, func=mx.ndarray.zeros, **kwargs):
        states = []
        for i, info in enumerate(self.state_info(batch_size)):
            if info is not None:
                info = kwargs
            states.append(func(name='%sh0_%d'%(self.prefix, i), **info))
        return states

args_nof_examples = 1
args_seq_len = 10
args_feature_size = 1
args_output_size = 1
args_nof_batches = 1
args_batch_size = 1

hidden_unit_size = args_feature_size * args_output_size

data = mx.ndarray.zeros(shape=(args_nof_examples, args_seq_len, args_feature_size))
labels = mx.ndarray.ones((args_nof_examples, args_seq_len, args_feature_size))
generator = mx.io.NDArrayIter(data, labels, args_batch_size, last_batch_handle='discard')

with mx.cpu(0) as context:
    model = Test(hidden_unit_size, args_seq_len, args_feature_size, args_output_size)
    model.initialize(mx.init.Xavier(), ctx = context)

    loss = mx.gluon.loss.SoftmaxCrossEntropyLoss()

    states = model.begin_state(args_batch_size)
    for batch in generator:
        with mx.autograd.record():
            dis, gen = model(batch.data[0], states)
            L = loss(dis, batch.label[0])

Steps to reproduce

  1. python test.py

What have you tried to solve it?

  1. Minimized from real world problem to the above minimal example in order to understand if I'm doing something wrong.
MoritzMaxeiner commented 6 years ago

After further experimentation it seems that the reshape operation

F.reshape(a, shape=(0,1,self.feature_size, self.output_size)

is the issue here, as it works fine when I move it outside of the hybrid_forward and perform it on the NDArray result.

reminisce commented 6 years ago

Seems like some problem of memory invalid access. With the latest master branch code, the example gives a seg fault on Ubuntu without any error message, but on Mac it could run through. Definitely some undefined behavior going under the hood.

MoritzMaxeiner commented 6 years ago

@reminisce Hm, I've tried out 1.0.0rc0 now and so far I haven't been able to reproduce the issue in that version. If there's an issue in master, I'd assume that to be a (separate) regression?

reminisce commented 6 years ago

I think there is an undefined behavior of backend (C++) code, as the example behaves differently on different platforms with the latest master branch code. It would be very helpful for us to debug if you could simplify the example as much as possible based upon the commit c1846ce.

MoritzMaxeiner commented 6 years ago

The issue is also present in 0.12.1. @reminisce I'll try to reduce it further.

MoritzMaxeiner commented 6 years ago

@reminisce I've removed the time unrolling (and the issue is still being triggered), but if I remove either of the two cells, or the reshape operation, the issue won't arise, so I don't think I can reduce it any further.

import mxnet as mx

class Test(mx.gluon.HybridBlock):
    def __init__(self, input_size, output_size, **kwargs):
        super(Test, self).__init__(**kwargs)
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_unit_size = output_size*input_size

        self.num_cells = 2
        with self.name_scope():
            self.cell_a = mx.gluon.rnn.GRUCell(self.hidden_unit_size, input_size=input_size)
            self.cell_b = mx.gluon.rnn.GRUCell(self.hidden_unit_size, input_size=self.hidden_unit_size)

    def hybrid_forward(self, F, inputs, states):
        prev_h = states[0]
        if F is mx.symbol:
            prev_h = F.split(prev_h, axis=0, num_outputs=self.num_cells, squeeze_axis=1)

        cell_a_next_h, _ = self.cell_a(inputs, [prev_h[0]])

        cell_b_next_h, _ = self.cell_b(prev_h[1], [prev_h[1]])

        b_output = cell_b_next_h.reshape(shape=(0, self.input_size, self.output_size))

        return cell_a_next_h, b_output, []

    def state_info(self, batch_size=0):
        return [{'shape': (self.num_cells, batch_size, self.hidden_unit_size), '__layout__': 'LNC'}]

    def begin_state(self, batch_size=0, func=mx.ndarray.zeros, **kwargs):
        states = []
        for i, info in enumerate(self.state_info(batch_size)):
            if info is not None:
                info = kwargs
            states.append(func(name='%sh0_%d'%(self.prefix, i), **info))
        return states

args_nof_examples = 1
args_nof_batches = 1
args_batch_size = 1

args_input_size = 1
args_output_size = 1

data = mx.ndarray.zeros(shape=(args_nof_examples, args_input_size))
labels = mx.ndarray.ones((args_nof_examples, args_input_size))
gen = mx.io.NDArrayIter(data, labels, args_batch_size, last_batch_handle='discard')

with mx.cpu(0) as context:
    model = Test(args_input_size, args_output_size)
    model.initialize(mx.init.Xavier(), ctx = context)

    loss = mx.gluon.loss.SoftmaxCrossEntropyLoss()

    states = model.begin_state(args_batch_size)
    for batch in gen:
        with mx.autograd.record():
            a, b, _ = model(batch.data[0], states)
            L = loss(b, batch.label[0])
szha commented 6 years ago

The above script didn't trigger any error on my side. I'm using recent commit 3c32f765

SuperLinguini commented 6 years ago

Proposed Label: "Python", "Bug", "Gluon"

MoritzMaxeiner commented 6 years ago

For what it's worth, I don't have this issue with MXNet 1.0.0 nor 1.1.0, but I don't know if that's because the root cause (which is unknown to me) has been fixed, or if it just doesn't get triggered anymore, so I'm hesitant to close.

sxjscience commented 6 years ago

@MoritzMaxeiner The issue somehow does not exist any more and I'm going to close it. However, the root cause is still unclear.

@reminisce Do you have any idea about it?

reminisce commented 6 years ago

@MoritzMaxeiner Is the result produced by your script expected?

MoritzMaxeiner commented 6 years ago

@reminisce I'm unsure as to what you're asking, specifically. The script should ideally have terminated with exit code 0 and no stdout/stderr output, which is not what happened, so that was unexpected for me.

reminisce commented 6 years ago

@MoritzMaxeiner Sorry should have made question clearer. I wanted to know that if you could reproduce the issue every time using the latest code/release. If not, when it does not crash, is the numerical result expected? This information would be helpful for us to find out whether it's an error of the implementation of the logic or some careless typo resulting in invalid memory access.

MoritzMaxeiner commented 6 years ago

@reminisce Ah, ok. W.r.t. reproducing: I haven't encountered the issue in 1.0 or 1.1, so no I can't reproduce it in lastest code/release. Concerning the numerical result: That's hard to say precisely as it involves plently of (inherently inaccurate) floating point math (so I can't just calculate the operations another way and compare the results). If a guestimate is of use to you: The RNN trains and predicts as I would expect it to (in 1.0 and 1.1).

reminisce commented 6 years ago

@MoritzMaxeiner Thanks for the answers. Since the latest code is working fine, can we close the ticket for now? Please feel free to reopen it once it appears again.

MoritzMaxeiner commented 6 years ago

@reminisce Sure, I just didn't close for the reasons mentioned here and here.

vandanavk commented 6 years ago

@sandeep-krishnamurthy Please close this issue as it is not reproducible on latest code.

@MoritzMaxeiner @sxjscience Please feel free to reopen this issue if you see it again.