Lower accuracy when using SeqGRU instead of GRU with Sequencer

Element-Research / rnn

Recurrent Neural Network library for Torch7's nn

BSD 3-Clause "New" or "Revised" License

939 stars 313 forks source link

Lower accuracy when using SeqGRU instead of GRU with Sequencer #343

Closed sidharthms closed 7 years ago

sidharthms commented 7 years ago

Has anyone run into this before? I haven't read the source code of SeqGRU carefully but I'm assuming it was written to be identical in structure to a GRU with a Sequencer. Are there any known differences that might be causing this?

The weight initialization seems to be different, but I'm not sure if that's the problem. I was going to try and initialize SeqGRU's weights similar to the way linear layers are initialized in GRU.

To be concrete, here's how I'm using them:

      if opt.optimized_pass1 then
         print('OPTIMIZED PASS 1')
         pass1:add(nn.SeqGRU(opt.wordvec_size, hiddenSize))
         pass1:add(Select(1, -1)) -- this selects the last time-step of the rnn output sequence
      else
         print('REGULAR PASS 1')
         pass1:add(nn.Sequencer(nn.GRU(opt.wordvec_size, hiddenSize)))
         pass1:add(nn.SelectTable(-1)) -- this selects the last time-step of the rnn output sequence
      end

JoostvDoorn commented 7 years ago

It's true that the initialization is different, other than that SeqGRU should behave the same as GRU, the test code which checks this is defined here: https://github.com/Element-Research/rnn/blob/master/test/test.lua#L6233

You can convert from a SeqGRU to a GRU, so you could potentially check how it behaves if you initialize your model as such:

print('REGULAR PASS 1')
pass1:add(nn.Sequencer(nn.SeqGRU(opt.wordvec_size, hiddenSize):toGRU()))

sidharthms commented 7 years ago

It looks like initialization was the problem. When I created a GRU from a SeqGRU, my accuracy scores dropped by about 3%. And when I copied over the weights and biases from a GRU to a SeqGRU, it was able to reach better optima.

I'm having similar issues with cudnn's GRU as well. Looks like initializing all gates with the same std isn't working very well. I'll probably file an issue in that project as well. For now I think I'll just copy over the params.

Do you think we should do something about the current default way of initializing SeqGRUs? Or perhaps is the difference unnoticeable in most cases?

JoostvDoorn commented 7 years ago

Would be great if you could figure out what is a good default. You could try self.bias:normal(0,std), as this is how @guillitte used it.

nicholas-leonard commented 7 years ago

This is how SeqGRU initializes parameters: https://github.com/Element-Research/rnn/blob/master/SeqGRU.lua#L79-L87.

On the other hand, GRU just uses the default module initializations: https://github.com/Element-Research/rnn/blob/master/GRU.lua#L42-L130. So it uses this: https://github.com/torch/nn/blob/master/Linear.lua#L39-L40.

@JoostvDoorn @sidharthms Do you think we should modify SeqGRU to use the same initialization as GRU?

sidharthms commented 7 years ago

Seems to work fine now, the problem was probably due to some other issue I guess.