Closed moonscar closed 6 years ago
Hi, thanks for reporting this error. Are you saying that with your modification you can run without the exception, or do you observe the reported problem either way?
With my modification
gates = self._hN.normalize(self._shape_fix + h2h)
Operation above can run without the exception on this small dataset.
But the origin operation
gates = self._iN.normalize(i2h) + self._hN.normalize(self._shape_fix + h2h)
reports memory overflow.
The full modification without exception present below:
if True or self._counter == 0:
self._shape_fix = mx.sym.zeros_like(i2h)
else:
assert self._shape_fix is not None
h2h = mx.sym.FullyConnected(data=states[0], weight=self._hW, bias=self._hB,
num_hidden=self._num_hidden * 4,
name='%sh2h' % name)
gates = self._hN.normalize(self._shape_fix + h2h)
Interesting. If I remember correctly, the reason we added this _shape_fix
is that without it, MXNet cannot infer the shape of the h2h normalization (since the initial state of the LSTM is created somehow with delayed shape information: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/rnn/rnn_cell.py#L433)
So the hack we did was to use zeros_like
to get the shape of the input (i2h
) at runtime. For efficiency we wanted to do this only for the first call to the cell when being unrolled, but we never checked whether this actually gives any gains.
Maybe the right way would be to always add this shape_fix.
Quick update on this: With the next release of mxnet, there will be a LayerNorm operator that should enable us to get rid of the shape fix logic entirely. I created a branch/commit for Sockeye that uses it: https://github.com/awslabs/sockeye/commit/1da36714b3a44cf673a99745d3fe781dfe11cfb2
I am closing this issue for now. Feel free to reopen if the problem re-occurs after we updated to the next version of mxnet (1.2).
Hi, I want to use the built-in LSTM to train a model. However, when I chose the cell-type "lnlstm" there is a error. I guess the error is caused by different dimension of i2h and h2h vector. Here is my training script.
trainning script
python -m sockeye.train --source /home/user/multi30k/train.en --target /home/user/multi30k/train.de --validation-source /home/user/multi30k/val.en --validation-target /home/user/multi30k/val.de --output test_train --encoder rnn --decoder rnn --num-layers '6:6' --rnn-num-hidden 512 --rnn-cell-type lnlstm --rnn-residual-connections --optimizer adam --initial-learning-rate 0.0002 --learning-rate-reduce-factor 0.7 --learning-rate-reduce-num-not-improved 8 --max-num-checkpoint-not-improved 32 --batch-size 4000 --batch-type word --rnn-attention-type mlp --rnn-dropout-inputs 0.1 --rnn-decoder-hidden-dropout 0.2 --use-tensorboard --checkpoint-frequency 4000 --rnn-attention-in-upper-layers --device-id
error messageI modified sockeye/rnn.py:215
I always allocate self._shape_fix so trainning can run as exception.