Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
941 stars 313 forks source link

Sequencer remember/forget with "eval" mode #19

Closed boknilev closed 9 years ago

boknilev commented 9 years ago

Hi,

Recent changes to the Sequencer remember/forget mechanism introduced modes like "both" and "eval", which is very convenient. However, in "eval" mode, a forward step during evaluation will set the maximum number of BPTT steps (rho value) to the size of the input. Then, a subsequent epoch of training on a sequence of different size will fail in the backward step. Before the change, remember() worked fine.

The reason is probably the setting of rho in the recurrent module (in this case LSTM), which then causes the backward step during training to stop before reaching the beginning of the sequence. See LSTM:updateGradInputThroughTime().

Note: I know that the README says it is recommended to set mode="both" for LSTM, but I prefer the "eval" mode because each training example is independent. In any case, I suppose both modes should be possible for any AbstractRecurrent instance.

A minimal working example with LSTMs:

lstm = nn.LSTM(5,5)
seq = nn.Sequencer(lstm)
inputTrain = {torch.randn(5), torch.randn(5), torch.randn(5)}
inputEval = {torch.randn(5)}

modes = {'both', 'eval'}
for i, mode in ipairs(modes) do
  print('\nmode: ' .. mode)
  seq:remember(mode)

  -- do one epoch of training
  seq:training()
  seq:forward(inputTrain)
  seq:backward(inputTrain, inputTrain)

  -- evaulate
  seq:evaluate()
  seq:forward(inputEval)

  -- do another epoch of training
  seq:training()
  seq:forward(inputTrain)
  -- this will fail when mode = 'eval'
  seq:backward(inputTrain, inputTrain)
end

Could you look into that?

Many thanks for your help.

nicholas-leonard commented 9 years ago

I love these corner cases. Looking into it.

nicholas-leonard commented 9 years ago

fixed in https://github.com/Element-Research/rnn/commit/e1f0c5049b8c41ad952d7b252b74035987b4a02b . Thank you for the detailed bug report!

boknilev commented 9 years ago

Thank you for the quick solution!

boknilev commented 9 years ago

Hi,

I notice a strange problem when using the new version. Before the update of the new remember/forget mechanism, I was training a model with LSTMs (in a sequencer) and got good training behaviour: training error was continuously decreasing, validation error was decreasing for a while, then converging. After the update, I notice the training error stops decreasing after a couple of epochs and starts increasing.

I know this is a very high-level description, but do you have any idea what might have changed? I suspect there is a problem with the gradients, although all the tests pass.

Thanks for your help.

nicholas-leonard commented 9 years ago

Let me start an LSTM experiment on PennTreeBank to see how it does.

nicholas-leonard commented 9 years ago

Seems to work on my end. Do you have particular use case you want me to test? Like with remember('eval') with and LSTM?

boknilev commented 9 years ago

Hi,

It's difficult to debug and narrow down the problem to a simple example, but I'll try. The general symptom is that before the change I was seeing good convergence and decreasing training errors, and after the change I see the training error first decreasing, then increasing continuously. I'll see if I can recreate the problem in a simple example.

nicholas-leonard commented 9 years ago

Thanks boknilev.

Nicholas Léonard 917-741-7570

On Wed, Aug 5, 2015 at 4:57 PM, boknilev notifications@github.com wrote:

Hi,

It's difficult to debug and narrow down the problem to a simple example, but I'll try. The general symptom is that before the change I was seeing good convergence and decreasing training errors, and after the change I see the training error first decreasing, then increasing continuously. I'll see if I can recreate the problem in a simple example.

— Reply to this email directly or view it on GitHub https://github.com/Element-Research/rnn/issues/19#issuecomment-128146365 .

boknilev commented 9 years ago

Hi,

I still can't track down the source of the problem. I did notice the following behaviour: After the code update, my gradients are much larger than before (an order of magnitude larger). I believe the problem may be with some nn code update and not with rnn, because I tried reverting to a previous rnn version (by downloading the .lua files from the repo) and still had the problem.

Do you have any suggestion what might cause this behaviour? Do you know how to revert back to a previous state of the nn package? I can't simply download .lua files from a previous repo version, because it also needs compilation, which I only know to do with "luarocks install nn",

Thanks for your help.

nicholas-leonard commented 9 years ago
git clone git@github.com:torch/nn.git
cd nn
git checkout [commit hash]
luarocks make rocks/[tab]

Maybe you should try a smaller learning rate. What does your model look like?

boknilev commented 9 years ago

Hi,

Thanks. I'll try that.

Yes, a smaller learning rate and larger dropout rate helps a bit. I still don't get as good performance as I had prior to the code update though.

It's an LSTM autoencoder for sentences, a la sequence-to-sequence learning. Model is roughly an LSTM in the encoder and an LSTM in the decoder, with dropout layers, and Softmax on the decoded words. Interface of encoder-decoder is currently by using the final output of the encoder as the first input to the decoder, although I'm aware of issue https://github.com/Element-Research/rnn/issues/16.

nicholas-leonard commented 9 years ago

You really have to find an older version where your code was working. You could also try older versions of the rnn package if trying older versions of nn doesn't work.

nicholas-leonard commented 9 years ago

Closing for now.