IDSIA / brainstorm

Fast, flexible and fun neural networks.
Other
1.3k stars 152 forks source link

Merge forward and backward buffers #21

Closed untom closed 9 years ago

untom commented 9 years ago

I wanted to continue the discussion about merging the forward- and backward buffers into a single buffer. The main advantage I see in this is that the code will be easier to read: whenever I show people brainstorm code, one of the first questions that pops up is "what do these buffers mean/why are there 2 of them" -- admittedly my sample size so far is 2 (I myself was also confused, so let's make it 3 ;) ). Also a pattern that emerges often within the code is something along the lines of:

    Ha = forward_buffers.internals.Ha
    dHa = backward_buffers.internals.Ha

Which indicates, to me at least, that the true name of backward_buffers.internals.Ha is should actually be buffers.internals.dHa, anyways.

If we implement this change, there are two ways to go:

  1. Explicitly list the 'backward buffers' in the get_internal_structure:

    def get_internal_structure(self):
       internals = OrderedDict()
       internals['Ha'] = ShapeTemplate('T', 'B', self.size)
       internals['dHa'] = ShapeTemplate('T', 'B', self.size)
       return internals
  2. Have a flag for each buffer requested via get_internal_structure, which indicates whether we need a "backward" buffer as well. In that case, just append a "d" to the name of the "forward buffer" we created. e.g.:

    def get_internal_structure(self):
       internals = OrderedDict()
       internals['Ha'] = ShapeTemplate('T', 'B', self.size, needs_gradient_buffer=True)
       return internals

The advantage of the 2nd approach is that it would still be fairly easy to lazily allocate backward buffers, and that it's less wordy. The downsize is that it might be a bit "magical". Which is why I actually prefer the first approach, even though it wastes memory and requires more typing. Typically, if you have enough memory the run a forward path, you'll have enough for the backward path as well, IMO.

Anyways... thoughts?

Qwlouse commented 9 years ago

I wanted to continue the discussion about merging the forward- and backward buffers into a single buffer. The main advantage I see in this is that the code will be easier to read: whenever I show people brainstorm code, one of the first questions that pops up is "what do these buffers mean/why are there 2 of them" -- admittedly my sample size so far is 2 (I myself was also confused, so let's make it 3 ;) ).

That's a good argument. So let's see what we can do.

For the internal buffers both variants you provide are feasible. If we are worried about memory we could even add a flag to 1) that marks a buffer as gradient.

For the weights the issue is more tricky, because for the trainer we assume that all the weights are layed-out together, and that an identically structured gradient buffer exists. We use that for example by adding the gradient-buffer times a learning rate to the weights-buffer. A similar problem exists for the output and input buffers. This means that brainstorm needs to know which buffers are gradients to which other buffers, and also make sure they do exist.

That pretty much rules out variant 1, unless we want to explicitly add a gradient_for='Ha' field to every buffer.

Variant 2 would still work. We could try to make it more explicit by requiring a name for the gradient buffer via an additional parameter:

internals['Ha'] = ShapeTemplate('T', 'B', self.size, gradient_buffer_name='dHa')

Or by requiring the name to be a tuple:

internals[('Ha', 'dHa')] = ShapeTemplate('T', 'B', self.size)

Another issue I want to bring to attention is that we often use the convenient tuple-unpacking:

W, bias = forward_buffers.parameters
dW, dbias = backward_buffers.parameters

And that would have to be converted either to:

W, dW, bias, dbias = buffers.parameters

which would clutter the forward pass, or to:

W = buffers.parameters.W
bias = buffers.parameters.bias
dW = buffers.parameters.dW
dbias = buffers.parameters.dbias

This problem is especially visible for the LSTM forward pass:

(Wz, Wi, Wf, Wo,
 Rz, Ri, Rf, Ro,
 bz, bi, bf, bo) = forward_buffers.parameters
Za, Zb, Ia, Ib, Fa, Fb, Oa, Ob, Ca, Cb = forward_buffers.internals
untom commented 9 years ago

One solution would then be to store them in a new gradients (and interal_gradients?) field instead of parameters:

W, bias = buffers.parameters
dW, dbias = buffers.gradients

(But I'm not sure if that's just shifting the "forward vs backward buffer" complexity into a "buffer-subfields"-complexity thing)

Qwlouse commented 9 years ago

That is indeed something we can do. We can have a single buffer with a structure like this, where the ones with a star are only available during the backward pass:

buffers
    parameters
        W
        bias
    gradients*
        W
        bias
    inputs
        default
    input_deltas*
        default
    outputs
        default
    output_deltas*
        default
    internals
        Ha
        dHa

Then the setup for the backward pass of the fully connected layer would look like this:

W, W_bias = buffers.parameters
dW, dW_bias = buffers.gradients
inputs = buffers.inputs.default
outputs = buffers.outputs.default
in_deltas = buffers.input_deltas.default
out_deltas = buffers.output_deltas.default
Ha, dHa = buffers.internals

To me this looks rather clean and readable. But then again, I came up with the forward and backward buffers in the first place.

This layout would have the added benefit, that we can do away with the implicit internal gradients. (I snuk that in by having both Ha and dHa in the internals buffers instead of having a internal_gradients) They are often not needed, and if they are, they can be set up explicitly. BTW: we should probably think about renaming Ha or getting rid of it anyways.

We could rename the entries of the gradients to dW and dbias but I'd prefer not to. Because if they share the same name, the correspondence is very clear, and their different functions should be very obvious from being part of parameters and gradients.

flukeskywalker commented 9 years ago

To clarify, we came up with above proposal together, so I agree with it. @untom, what do you think?

untom commented 9 years ago

I agree that it looks nice, I'll ask around tomorrow what Karin & Andi think.

BTW: we should probably think about renaming Ha or getting rid of it anyways.

In some rare cases it's nice to be able to look at presynaptic input. But we could of have a "fast/memory-efficient" version of the fully-connected layer that does the activation function (and while we're at it, maybe also dropout?) in-place to save some memory.... probably that version should even be the default, because it's rare to really need Ha.

As for the name changes, maybe the a/b suffixes for pre/post synaptic activations are a bit non-intuitive. I'd find it clearer if the pre-synaptic one had a suffix like pre (or whatever) and the "correct" one has no suffix (doesn't matter much for the fully-connected layer, but e.g. in the LSTM case it allows a more direct mapping from letter-used-in-the-equation to variable-used-in-code).

However in the case of the fully-connected layer, this brings up a slight inconsistency: the weights are called W, the presynaptic hidden activations Ha, and in the LSTM case most gate-activations are single-character letters as well. Yet the input is always referred to as inputs and the output as outputs. Arguably, names like X for the input and Y/H for the output would be more consistent

flukeskywalker commented 9 years ago

Indeed, we already plan to change the names of the quantities to be more consistent.

Also, the current plan is to make the FullyConnectedLayer apply activation functions in place, so it will only support certain activation functions.

This way, one can use FullyConnectedLayer in linear mode followed by an activation function layer in other cases.

On 4 August 2015 at 22:25, untom notifications@github.com wrote:

I agree that it looks nice, I'll ask around tomorrow what Karin & Andi think.

BTW: we should probably think about renaming Ha or getting rid of it anyways.

In some rare cases it's nice to be able to look at presynaptic input. But we could of have a "fast/memory-efficient" version of the fully-connected layer that does the activation function (and while we're at it, maybe also dropout?) in-place to save some memory.... probably that version should even be the default, because it's rare to really need Ha.

As for the name changes, maybe the a/b suffixes for pre/post synaptic activations are a bit non-intuitive. I'd find it clearer if the pre-synaptic one had a suffix like pre (or whatever) and the "correct" one has no suffix (doesn't matter much for the fully-connected layer, but e.g. in the LSTM case it allows a more direct mapping from letter-used-in-the-equation to variable-used-in-code).

However in the case of the fully-connected layer, this brings up a slight inconsistency: the weights are called W, the presynaptic hidden activations Ha, and in the LSTM case most gate-activations are single-character letters as well. Yet the input is always referred to as inputs and the output as outputs. Arguably, names like X for the input and Y/H for the output would be more consistent

— Reply to this email directly or view it on GitHub https://github.com/Qwlouse/brainstorm/issues/21#issuecomment-127748838.

untom commented 9 years ago

Awesome! FYI, Andreas likes the new buffer design as well.

Qwlouse commented 9 years ago

I got started on changing the buffer layout. But there are some questions remaining.

So far we had:

net.buffer.backward.Layer_5.parameters.W

Where net.buffer was the BufferManager object, and starting from backward they were all BufferViews. Now this would become:

net.buffer.Layer_5.gradients.W

But I kind of dislike net.buffer being a BufferManager object now. That aversion is partially because that would lead to weird name clashes if you named your layer layout, resize, or like any of the other members. So I'd suggest to "hide" the BufferManager (e.g. as net._buffer_manager) and to have net.buffer be a BufferView object.

We also considered to shorten the access pattern even more to:

net.Layer_5.gradients.W

But we found this to be confusing, because net.Layer_5 would not be a Layer instance but a BufferView. (Also we'd again have these weird name clashes).

untom commented 9 years ago

Just food for thought: If (when?) brainstorm will be parallelized, there will likely be one buffer manager per GPU, so hiding it now might cause trouble in the future.

untom commented 9 years ago

So I think it would be better to leave buffer in there (and maybe let it become a list in future versions). To avoid name-clashes, two options would be

Even though it's more typing, the advantage is that you can iterate over these:

interesting_layers = ['conv1', 'conv2', 'fc6']
for layername in interesting_layers:
    foo(net.buffer[layername]['gradients']['W'])

Which you can only do with a dict-like interface.

Qwlouse commented 9 years ago

The BufferViews are in fact already supporting, dotted access, dict like access, and tuple unpacking.

Alright, I'll leave the buffer in there, and keep the BufferManager as net._buffer_manager because I think you shouldn't access it from the outside anyways.

Qwlouse commented 9 years ago

Done