BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.13k stars 18.68k forks source link

Critical Backpropagation Failure across Reshape layer in complex networks #6769

Open CorvusCorax opened 5 years ago

CorvusCorax commented 5 years ago

Issue summary

Networks using the Reshape layer fail to back-propagate when certain conditions are met. Instead the bottom layer's diff blob will include only zeros, regardless of the content of the Reshape layer's diff blob.

This issue is not reproducible when using a single Reshape layer alone. In the simplest reproducible combination, a Reshape layer followed by a Flatten layer will trigger the erroneous behavior.

However two functionally identical Reshape Layers or two Flatten layers will not!

The bug also does appear in more complex, production networks in different layer combinations and prevents training. All examples observed by the issue reporter include at least a Reshape layer, although depending on the underlying cause this does not necessarily rule out potential cases without.

Steps to reproduce

  1. Create a network with a Reshape layer followed by a Flatten layer.
  2. Make a loss back propagate through both layers.
  3. Observe gradients of all involved layers.

Expected behavior:

Actual behavior:

Demo:

Attached python script to generate a trivial network that triggers the error trigger_bug.py.txt

#!/usr/bin/python

import numpy as np
import caffe
from caffe import layers as L, params as P

def gen_network(net_path, trigger_bug=True):
    n = caffe.NetSpec()

    n.label =  L.DummyData(data_filler=dict(type="constant",value = 0),
                                    shape=[ dict(dim=[1,2]) ] )
    n.data = L.DummyData(data_filler=dict(type="constant",value = 1),
                                    shape=[ dict(dim=[1,1]) ] )

    n.fc = L.InnerProduct(n.data, num_output=2, bias_term=False, axis=1,
                          weight_filler=dict(type='constant', value=1))

    if (trigger_bug==True):
        n.output = L.Reshape(n.fc, shape=dict(dim=[1,-1]))
    else:
        n.output = L.Flatten(n.fc, axis=1)

    n.flat = L.Flatten(n.output,axis=1)
    n.loss = L.EuclideanLoss(n.flat,n.label)

    ns=str(n.to_proto())
    with open(net_path, 'w') as f:
        f.write(ns)

def test_net(net_path):
    print("Testing net: ",net_path)
    testnet=caffe.Net(net_path,caffe.TRAIN)
    testnet.forward()
    testnet.backward()
    print("gradient after Flatten layer: ", testnet.blobs['flat'].diff)
    print("gradient after output  layer: ", testnet.blobs['output'].diff)
    print("gradient after FC      layer: ", testnet.blobs['fc'].diff)
    if (np.all( testnet.blobs['output'].diff == testnet.blobs['fc'].diff)):
        print("Network computed correctly.")
    else:
        print("Backpropagation failed!")

net_path_buggy='/tmp/buggy_net.prototxt'
net_path_notbuggy='/tmp/notbuggy_net.prototxt'
gen_network(net_path_buggy,trigger_bug=True)
gen_network(net_path_notbuggy,trigger_bug=False)

test_net(net_path_notbuggy)
test_net(net_path_buggy)

output.log

('Testing net: ', '/tmp/notbuggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output  layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC      layer: ', array([[1., 1.]], dtype=float32))
Network computed correctly.
('Testing net: ', '/tmp/buggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output  layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC      layer: ', array([[0., 0.]], dtype=float32))
Backpropagation failed!

Notes:

Tried solutions

The bug is very obscure and hard to identify as a cause of training failure in the first place, especially in deep networks. If identified, some networks can be re-designed, for example using a Flatten layer instead of a Reshape layer in order to avoid triggering the issue. In other cases, such as Convolutional LSTMs this workaround is NOT possible, since the specific shape required for RNN layers is not achievable using Flatten alone.

System configuration

Issue checklist

CorvusCorax commented 5 years ago

I forgot to mention. All caffe installations tested do report PASSED after make runtest Branch tested was master