Kaixhin / nninit

Weight initialisation schemes for Torch7 neural network modules
MIT License
100 stars 12 forks source link

LSTM Support #3

Closed Kaixhin closed 8 years ago

Kaixhin commented 8 years ago

Although this library works fine with nngraph, it would be good to also support rnn - specifically the LSTM module. Given the new API introduced with https://github.com/Kaixhin/nninit/issues/2, how can the elements of the cell be initalised individually? Any feedback @nicholas-leonard?

A notable reason to support this would be to implement the large forget gate bias introduced in:

Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.

The idea of nninit is to allow experimentation with initialisations/free maintainers from implementing "best practices".

bshillingford commented 8 years ago

Might be better to try to support https://github.com/Element-Research/rnn/blob/master/FastLSTM.lua instead, since it has batched GEMMs. This LSTM implementation has batched GEMMs as well, but uses nngraph: https://github.com/karpathy/char-rnn/blob/master/model/LSTM.lua.

You'd need a way to apply :init() to a submatrix. In the rnn.FastLSTM case, it'd be something like for i from 0 to 3, applying the initializations to the square (assuming LSTM input size = state/output size) submatrix given by i2g.weight:narrow(1, 1+i*outputSize, outputSize). Note the output dimension for weight matrices in nn.Linear is the 1st dim.

nicholas-leonard commented 8 years ago

@Kaixhin So I guess we could just pluck them out form the respective sub-components of the LSTM or FastLSTM modules. For the first part, we could use the internal references to self.inputGate (https://github.com/Element-Research/rnn/blob/master/LSTM.lua#L53) and such.

@bshillingford is right about how to do this for FastLSTM.

If you make this extension of the rnn package, I will make nninit a default dependency of the rnn package.

Kaixhin commented 8 years ago

@nicholas-leonard I realised that the internal references were sufficient for nn.LSTM, but suspected something more powerful was needed in general. As @bshillingford pointed out, that something is indexing. I've introduced this for the basic functions in a new branch with 24cf519, along with a nn.FastLSTM example. Since this is general, I don't see any need to introduce nninit as a dependency for rnn, but if you want to introduce it as a full or optional dependency that's fine too.

The main issue with nninit as a dependency is that the API is still in flux, so it would be good to settle on something (even if breaking changes are introduced along with incrementing the luarocks version number). Points to consider and give feedback on:

bshillingford commented 8 years ago

@Kaixhin Maybe there's a way to more strongly decouple the modified tensors from the nn Modules.

Re: passing initialisation schemes as strings, what do you think about putting the schemes into a namespace then accepting them directly as arguments? E.g. might look like

local nninit = require 'nninit'
local m = nn.Linear(5,5):wInit(nninit.orthogonal)

where that's either the function itself or a wrapper object. Subjectively cleaner.

You've done some decoupling already but as a private API. Publicising the private API (maybe in a sub-namespace?) for initialisation primitives might look something like:

local nninit = require 'nninit'
local lstm_input = nn.Linear(input_size, rnn_size*4) -- input weights
for _, wt in ipairs(lstm_input.weight:split(1, rnn_size)) do
  nninit.inits.xavier(wt)
end
local lstm_recur = nn.Linear(rnn_size, rnn_size*4) -- recurrent weights
for _, wt in ipairs(lstm_recur.weight:split(1, rnn_size)) do
  nninit.inits.orthogonal(wt)
end

Less concise, but more flexible... there might be a way to find a good middle ground here. Perhaps with a function as the arg you can generalise more:

local lstm_recur = nn.Linear(rnn_size, rnn_size*4)
    :init(function(m) return m.weight:narrow(1, 1, rnn_size) end, ...)
    :init(function(m) return m.weight:narrow(1, rnn_size, rnn_size) end, ...)
    :init(function(m) return m.weight:narrow(1, 2*rnn_size, rnn_size) end, ...)...

where the first arg to init is something to initialise (let's call it target). If target is a string like bias or weight, you'd operate on self[target], and if target is a function, the function takes self and returns the tensor to modify. I guess that could be shortened by letting target return a multiple things:

local lstm_recur = nn.Linear(rnn_size, rnn_size*4)
    :init(function(m) return unpack(m.weight:split(1, rnn_size)) end, ...)

Having a single init might clean up the code a bit (in the sense of duplication and coupling) while improving flexibility. This might be making things too general for a small number of use cases, since almost all nn modules only have weight and bias. Also not sure how you'd address addConstant and mulConstant in this case.

Thoughts on these random ideas?

Kaixhin commented 8 years ago

Would like to have init instead of wInit and bInit, and the first argument could be w or b. Second argument can be the namespaced function, e.g. nninit.orthogonal(). All optional arguments within this should be passed through a table, such that (in the case of xavier) dist does not have to be specified in order to specify gain.

Functions for the target are powerful but lengthy - is the indexing operator not general enough? In this case would have the indices (either in table or binary mask form) as the third and final argument to init. Concretely:

lstm.i2g
    :init('w', nninit.kaiming({dist = 'uniform', gain = {gain = 'lrelu', leakiness = 0.3}}))
    :init('b', nninit.constant(1), {{3*hiddenSize-1, 3*hiddenSize}})

eye is an example of where exposing initialisation primitives would fail, as it depends on the module type. Another potential example would be initialising nn.SpatialConvolution with filters used in image processing (such as a Gaussian filter) - these initialisations would not make sense for linear layers (and many already do not make sense for biases). A further argument for this would be that the concept of abstraction in Torch is the module/layer, where a conscious decision has been made to tie weights and biases to modules rather than have them act as standalone tensors.

bshillingford commented 8 years ago

Good points. Indexing should be general enough in most cases, but I'm thinking of other obscure cases: suppose I write a layer for learned biases of some arbitrary shape, and it's parameters don't have the name 'weight' or 'bias', or lie inside a table inside the module. It would be useful to expose primitives if you want to do things like initialize conv filters to multiple scales of Gaussian filters too, but again this is an uncommon case and I suppose indexing solves this anyway (I think?).

I agree re Torch's philosophy, but I see your library as a way to address Torch's philosophy's shortcomings. That all said, I agree with your comments, thanks for sharing the library.

Sent from mobile On Dec 1, 2015 11:57 PM, "Kai Arulkumaran" notifications@github.com wrote:

Would like to have init instead of wInit and bInit, and the first argument could be w or b. Second argument can be the namespaced function, e.g. nninit.orthogonal(). All optional arguments within this should be passed through a table, such that (in the case of xavier) dist does not have to be specified in order to specify gain.

Functions for the target are powerful but lengthy - is the indexing operator https://github.com/torch/torch7/blob/master/doc/tensor.md#tensor--dim1dim2--or--dim1sdim1e-dim2sdim2e- not general enough? In this case would have the indices (either in table or binary mask form) as the third and final argument to init. Concretely:

lstm.i2g :init('w', nninit.kaiming({dist = 'uniform', gain = {gain = 'lrelu', leakiness = 0.3}})) :init('b', nninit.constant(1), {{3_hiddenSize-1, 3_hiddenSize}})

eye is an example of where exposing initialisation primitives would fail, as it depends on the module type. Another potential example would be initialising nn.SpatialConvolution with filters used in image processing (such as a Gaussian filter) - these initialisations would not make sense for linear layers (and many already do not make sense for biases). A further argument for this would be that the concept of abstraction in Torch is the module/layer, where a conscious decision has been made to tie weights and biases to modules rather than have them act as standalone tensors.

— Reply to this email directly or view it on GitHub https://github.com/Kaixhin/nninit/issues/3#issuecomment-161135933.

Kaixhin commented 8 years ago

I would like to be convinced that this could let weights be treated more generally and hence overcome a shortcoming with Torch, but this seems tricky. For example, nninit.inits.xavier(wt) depends on the fan in and fan out of the module - it could be worked out using the dimensions, but then eye will fail on nn.Linear vs. nn.TemporalConvolution.

I do see the case for more arbitrary parameters and am happy to go ahead with target being a string or function. I think that in the more general case if multiple tensors need to be returned then they should be placed in a table (rather than being unpacked), but in what situation would you provide a sequence of tensors when they would all be initialised by the same method?

One trivial way to do more complicated initialisations is to introduce a tensor function that just calls copy on a supplied tensor that matches the target size. Rather than operating directly on the weights (for example), using this within :init() would allow the convenience of chaining.

Kaixhin commented 8 years ago

OK new API almost ready on the indexing branch. The accessor argument is either a string, a string and set of indices, or a function. The "primitives" are somewhat exposed, and could potentially be used with nil as the first argument (a bit messy, but non-trivial initialisations in this case seem tied to the layer type/properties). A copy method has been added for custom initialisations.

eye and orthogonal need fixing as they are the most tied to the module. The example will also need revising once those two methods are fixed.

Kaixhin commented 8 years ago

Example now updated and simply having eye and orthogonal only process module weights. Will merge into master tomorrow unless there is any further API scoping to be done.