Merge container and BiRNN layers memory errors

Hello, I have been trying to use MergeBroadcast and BiRNN layers and faced some issues.

BiRNN implementation does not expect layer to be the first layer in the network. So, the code like:

layers = [ BiRNN(1, init=Xavier(), activation=Logistic(shortcut=True)) ]
model, cost = [Model(layers=layers), GeneralizedCost(costfunc=CrossEntropyMulti()))]

Fails during initialization with the following error:

File "<...>/neon/models/model.py", line 175, in fit
self.initialize(dataset, cost)
File "<...>/neon/models/model.py", line 130, in initialize
self.layers.allocate_deltas()
File "<...>/neon/layers/container.py", line 369, in allocate_deltas
self.set_deltas(self.global_deltas)
File "<...>/neon/layers/container.py", line 245, in set_deltas
l.set_deltas(global_deltas)
File "<...>/neon/layers/recurrent.py", line 1154, in set_deltas
self.out_deltas_buffer_f_v = self.out_deltas_buffer_f.reshape(nin, -1)
AttributeError: 'NoneType' object has no attribute 'reshape'

BiRNN layer allocation implementation does wrong allocation. More accurate, the allocate method might be called with a shared_outputs parameter, containing preallocated buffer for outputs (of shape out_shape), but allocate method does allocation of the _internalbuffer (of shape hidden_shape) there, which is larger than output. Thus, the code using preallocation for this layer fails.
```
self.out_shape = (2 * self.nout, self.nsteps)
self.hidden_shape = (2 * self.nout, self.nsteps + 2)
```
Quick fix is to allocate buffers for outputs (in the passed memory) and for internals (in another memory), but this doubles memory footprint. This also requires copying to output buffer in fprop.
Sequential container does extra allocation call / MergeBroadcast integration. In the allocate method of sequential container extra allocate call might occur for the last layer, that owning its outputs (see else branches). The code is calling allocate for this last layer and for all layers. Moreover, the second call does not include shared memory parameter, which might be passed to the sequential container. In simple cases this might lead to increased memory consumption, leaks, and the last layer outputs allocation override (this might be argued by this lines, but this is a shaky argument because of method overriding possibility). Now, consider the case of MergeBroadcast class, which is consists of a list of Sequential containers and a shared output buffer. The issue leads to overriding of this output buffer in the last layer (ignoring shared buffer). Thus, the Broadcast branch outputs are not merged as they are not written to the shared buffer. This becomes a problem in the case, where BiRNN layer is the last outputs owning layer in the branch. This layer overrides allocation method, so the repeated allocation occurs, and the second allocation will not be aware of shared memory.

MergeBroadcast and BiRNN interaction during bprop. Consider the following model:

# input = (3, 128, 128), batch size = 128
layers = [
Sequential([
    Conv(fshape=(1, 1, 128), padding=0, strides=64, dilation=1, init=Xavier()),
    BiRNN(2, init=Xavier(), activation=Tanh()),
    # Reshape((2 * 2, -1)) # - fix to the issue
]),
Sequential([
    Conv(fshape=(1, 1, 128), padding=0, strides=64, dilation=1, init=Xavier()),
    BiRNN(2, init=Xavier(), activation=Tanh()), # init=Constant(0) - broken
    # Reshape((2 * 2, -1)) # - fix to the issue
])
]
layers = [
MergeBroadcast(layers, 'stack'),
Affine(nout=2, init=Constant(0), activation=Logistic(shortcut=True))
]

Trying to run this will give the error, saying that something is wrong with the error shape:

  File "<...>/neon/layers/container.py", line 920, in bprop
    self.deltas, self.out_shape, alpha, beta, self.alphas, self.betas)
  File "<...>/neon/backends/nervanagpu.py", line 3248, in bprop_mergebroadcast
    l.bprop(e, alpha=a * alpha, beta=b)
  File "<...>/neon/layers/container.py", line 427, in bprop
    error = l.bprop(error)
  File "<...>/neon/layers/recurrent.py", line 1336, in bprop
    self.activation, True)
  File "<...>/neon/backends/nervanagpu.py", line 2831, in compound_rnn_unroll_bprop
    in_deltas[:] = activation.bprop(hs) * in_deltas
  File "<...>/neon/backends/nervanagpu.py", line 190, in __setitem__
    self.__getitem__(index)._assign(value)
  File "<...>/neon/backends/nervanagpu.py", line 373, in _assign
    OpTreeNode.build("assign", self, value)
  File "<...>/neon/backends/backend.py", line 1843, in build
    return node.execute()
  File "<...>/neon/backends/backend.py", line 1864, in execute
    return backend.execute(self)
  File "<...>/neon/backends/nervanagpu.py", line 1269, in execute
    return call_compound_kernel(self._get_rand_state_dev(), self.compute_capability, *stack)
  File "<...>/neon/backends/float_ew.py", line 823, in call_compound_kernel
    "Input shape:%s not compatible" % (shape,))
TypeError: Input shape:[2, 128] not compatible

I thought this is because of propagated error shape mismatched to the expected by BiRNN bprop method. The addition of Reshape layer, which is in the correct implementation should restore and propagate back its input shape, solved the issue. Nevertheless, I think the MergeBroadcast layer should restore output shapes for branches in the back propagation pass.

Environment: python 3.5.2, neon 2.6.0 (f9d771b), cuda 8.0, gpu K40s.

NervanaSystems / neon

Merge container and BiRNN layers memory errors #450