Using cell type lnlstm cause trainning error

moonscar commented 6 years ago

Hi, I want to use the built-in LSTM to train a model. However, when I chose the cell-type "lnlstm" there is a error. I guess the error is caused by different dimension of i2h and h2h vector. Here is my training script.

trainning script python -m sockeye.train --source /home/user/multi30k/train.en --target /home/user/multi30k/train.de --validation-source /home/user/multi30k/val.en --validation-target /home/user/multi30k/val.de --output test_train --encoder rnn --decoder rnn --num-layers '6:6' --rnn-num-hidden 512 --rnn-cell-type lnlstm --rnn-residual-connections --optimizer adam --initial-learning-rate 0.0002 --learning-rate-reduce-factor 0.7 --learning-rate-reduce-num-not-improved 8 --max-num-checkpoint-not-improved 32 --batch-size 4000 --batch-type word --rnn-attention-type mlp --rnn-dropout-inputs 0.1 --rnn-decoder-hidden-dropout 0.2 --use-tensorboard --checkpoint-frequency 4000 --rnn-attention-in-upper-layers --device-id error message

[11:33:47] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [11:33:47] src/operator/./slice_channel-inl.h:208: Check failed: dshape[real_axis] % param_.num_outputs == 0U (10 vs. 0) You are trying to split the 0-th axis of input tensor with shape [10,253,512] into num_outputs=100 evenly sized chunks, but this is not possible because 100 does not evenly divide 10

Stack trace returned 10 entries:
[bt] (0) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28965c) [0x7f5d6776565c]
[bt] (1) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2964e97) [0x7f5d69e40e97]
[bt] (2) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2688337) [0x7f5d69b64337]
[bt] (3) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x249c52f) [0x7f5d6997852f]
[bt] (4) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x249f039) [0x7f5d6997b039]
[bt] (5) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2482aa9) [0x7f5d6995eaa9]
[bt] (6) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2483564) [0x7f5d6995f564]
[bt] (7) /home/user/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXExecutorSimpleBind+0x2250) [0x7f5d698cec80]
[bt] (8) /home/user/miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f5d84fcf550]
[bt] (9) /home/user/miniconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(ffi_call+0x1f5) [0x7f5d84fcecf5]

I modified sockeye/rnn.py:215

        if True or self._counter == 0:
            self._shape_fix = mx.sym.zeros_like(i2h)
        else:
            assert self._shape_fix is not None

        h2h = mx.sym.FullyConnected(data=states[0], weight=self._hW, bias=self._hB,
                                    num_hidden=self._num_hidden * 4,
                                    name='%sh2h' % name)

        gates = self._hN.normalize(self._shape_fix + h2h)

I always allocate self._shape_fix so trainning can run as exception.

fhieber commented 6 years ago

Hi, thanks for reporting this error. Are you saying that with your modification you can run without the exception, or do you observe the reported problem either way?

moonscar commented 6 years ago

With my modification gates = self._hN.normalize(self._shape_fix + h2h) Operation above can run without the exception on this small dataset.

But the origin operation gates = self._iN.normalize(i2h) + self._hN.normalize(self._shape_fix + h2h) reports memory overflow.

The full modification without exception present below:

        if True or self._counter == 0:
            self._shape_fix = mx.sym.zeros_like(i2h)
        else:
            assert self._shape_fix is not None

        h2h = mx.sym.FullyConnected(data=states[0], weight=self._hW, bias=self._hB,
                                    num_hidden=self._num_hidden * 4,
                                    name='%sh2h' % name)

        gates = self._hN.normalize(self._shape_fix + h2h)

fhieber commented 6 years ago

Interesting. If I remember correctly, the reason we added this _shape_fix is that without it, MXNet cannot infer the shape of the h2h normalization (since the initial state of the LSTM is created somehow with delayed shape information: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/rnn/rnn_cell.py#L433) So the hack we did was to use zeros_like to get the shape of the input (i2h) at runtime. For efficiency we wanted to do this only for the first call to the cell when being unrolled, but we never checked whether this actually gives any gains.

Maybe the right way would be to always add this shape_fix.

fhieber commented 6 years ago

Quick update on this: With the next release of mxnet, there will be a LayerNorm operator that should enable us to get rid of the shape fix logic entirely. I created a branch/commit for Sockeye that uses it: https://github.com/awslabs/sockeye/commit/1da36714b3a44cf673a99745d3fe781dfe11cfb2

I am closing this issue for now. Feel free to reopen if the problem re-occurs after we updated to the next version of mxnet (1.2).

awslabs / sockeye

Using cell type lnlstm cause trainning error #283