Critical Backpropagation Failure across Reshape layer in complex networks

Issue summary

Networks using the Reshape layer fail to back-propagate when certain conditions are met. Instead the bottom layer's diff blob will include only zeros, regardless of the content of the Reshape layer's diff blob.

This issue is not reproducible when using a single Reshape layer alone. In the simplest reproducible combination, a Reshape layer followed by a Flatten layer will trigger the erroneous behavior.

However two functionally identical Reshape Layers or two Flatten layers will not!

The bug also does appear in more complex, production networks in different layer combinations and prevents training. All examples observed by the issue reporter include at least a Reshape layer, although depending on the underlying cause this does not necessarily rule out potential cases without.

Steps to reproduce

Create a network with a Reshape layer followed by a Flatten layer.
Make a loss back propagate through both layers.
Observe gradients of all involved layers.

Expected behavior:

After net.backward(), the values in the diff blob of the Reshape layer and the diff blob of it's bottom layer should be identical, only it's shape should differ.

Actual behavior:

The bottom layer's diff blob has the correct shape but is all zeros, regardless of the values in Reshape layer's diff blob.

Demo:

Attached python script to generate a trivial network that triggers the error trigger_bug.py.txt

#!/usr/bin/python

import numpy as np
import caffe
from caffe import layers as L, params as P

def gen_network(net_path, trigger_bug=True):
    n = caffe.NetSpec()

    n.label =  L.DummyData(data_filler=dict(type="constant",value = 0),
                                    shape=[ dict(dim=[1,2]) ] )
    n.data = L.DummyData(data_filler=dict(type="constant",value = 1),
                                    shape=[ dict(dim=[1,1]) ] )

    n.fc = L.InnerProduct(n.data, num_output=2, bias_term=False, axis=1,
                          weight_filler=dict(type='constant', value=1))

    if (trigger_bug==True):
        n.output = L.Reshape(n.fc, shape=dict(dim=[1,-1]))
    else:
        n.output = L.Flatten(n.fc, axis=1)

    n.flat = L.Flatten(n.output,axis=1)
    n.loss = L.EuclideanLoss(n.flat,n.label)

    ns=str(n.to_proto())
    with open(net_path, 'w') as f:
        f.write(ns)

def test_net(net_path):
    print("Testing net: ",net_path)
    testnet=caffe.Net(net_path,caffe.TRAIN)
    testnet.forward()
    testnet.backward()
    print("gradient after Flatten layer: ", testnet.blobs['flat'].diff)
    print("gradient after output  layer: ", testnet.blobs['output'].diff)
    print("gradient after FC      layer: ", testnet.blobs['fc'].diff)
    if (np.all( testnet.blobs['output'].diff == testnet.blobs['fc'].diff)):
        print("Network computed correctly.")
    else:
        print("Backpropagation failed!")

net_path_buggy='/tmp/buggy_net.prototxt'
net_path_notbuggy='/tmp/notbuggy_net.prototxt'
gen_network(net_path_buggy,trigger_bug=True)
gen_network(net_path_notbuggy,trigger_bug=False)

test_net(net_path_notbuggy)
test_net(net_path_buggy)

output.log

('Testing net: ', '/tmp/notbuggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output  layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC      layer: ', array([[1., 1.]], dtype=float32))
Network computed correctly.
('Testing net: ', '/tmp/buggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output  layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC      layer: ', array([[0., 0.]], dtype=float32))
Backpropagation failed!

Notes:

The issue appears both when computing gradients explicitly using net.backward() or implicitly using solver.step()
It makes no difference if the Reshape layer's shape is explicitly given or partially selfcomputed using shape=[0,...,-1]
Replacing the flat layer with an identical Reshape layer obscures the issue. So does replacing the Reshape layer with a second Flatten layer, as demonstrated.

Tried solutions

The bug is very obscure and hard to identify as a cause of training failure in the first place, especially in deep networks. If identified, some networks can be re-designed, for example using a Flatten layer instead of a Reshape layer in order to avoid triggering the issue. In other cases, such as Convolutional LSTMs this workaround is NOT possible, since the specific shape required for RNN layers is not achievable using Flatten alone.

System configuration

Operating system: Most likely any ! Issue reproduced with: Ubuntu Linux 16.04 LTS, Kernel 4.15 Ubuntu Linux 18.04 LTS, Kernel 4.15
Compiler: any ! Issue reproduced with: g++-5 (Ubuntu 5.4.0-6ubuntu1 16.04.11) 5.4.0 20160609 g++-6 (Ubuntu 6.5.0-2ubuntu1 18.04) 6.5.0 20181026 g++-7 (Ubuntu 7.4.0-1ubuntu1 18.04) 7.4.0 clang version 7.0.0-3~ubuntu0.18.04.1 (tags/RELEASE_700/final)
~CUDA version (if applicable)~: Not applicable, Issue appears in both GPU and CPU only mode
~CUDNN version (if applicable)~: None
BLAS: any ! Issue reproduced with: atlas openblas
Python version (if using pycaffe): any ! Issue reproduced with python 2.7 python 3.6 ** The issue is expected to also appear in plain C++ implementations, since affected networks have failed to train using comandline caffe executable, but that has not been tested yet.
~MATLAB version (if using matcaffe)~: Not used

Issue checklist

[X] read the guidelines and removed the first paragraph
[X] written a short summary and detailed steps to reproduce
[X] explained how solutions to related problems failed (tick if found none)
[X] filled system configuration
[X] attached relevant logs/config files (tick if not applicable)

BVLC / caffe