Networks using the Reshape layer fail to back-propagate when certain conditions are met.
Instead the bottom layer's diff blob will include only zeros, regardless of the content of the Reshape layer's diff blob.
This issue is not reproducible when using a single Reshape layer alone. In the simplest reproducible combination, a Reshape layer followed by a Flatten layer will trigger the erroneous behavior.
However two functionally identical Reshape Layers or two Flatten layers will not!
The bug also does appear in more complex, production networks in different layer combinations and prevents training. All examples observed by the issue reporter include at least a Reshape layer, although depending on the underlying cause this does not necessarily rule out potential cases without.
Steps to reproduce
Create a network with a Reshape layer followed by a Flatten layer.
Make a loss back propagate through both layers.
Observe gradients of all involved layers.
Expected behavior:
After net.backward(), the values in the diff blob of the Reshape layer and the diff blob of it's bottom layer should be identical, only it's shape should differ.
Actual behavior:
The bottom layer's diff blob has the correct shape but is all zeros, regardless of the values in Reshape layer's diff blob.
Demo:
Attached python script to generate a trivial network that triggers the error
trigger_bug.py.txt
#!/usr/bin/python
import numpy as np
import caffe
from caffe import layers as L, params as P
def gen_network(net_path, trigger_bug=True):
n = caffe.NetSpec()
n.label = L.DummyData(data_filler=dict(type="constant",value = 0),
shape=[ dict(dim=[1,2]) ] )
n.data = L.DummyData(data_filler=dict(type="constant",value = 1),
shape=[ dict(dim=[1,1]) ] )
n.fc = L.InnerProduct(n.data, num_output=2, bias_term=False, axis=1,
weight_filler=dict(type='constant', value=1))
if (trigger_bug==True):
n.output = L.Reshape(n.fc, shape=dict(dim=[1,-1]))
else:
n.output = L.Flatten(n.fc, axis=1)
n.flat = L.Flatten(n.output,axis=1)
n.loss = L.EuclideanLoss(n.flat,n.label)
ns=str(n.to_proto())
with open(net_path, 'w') as f:
f.write(ns)
def test_net(net_path):
print("Testing net: ",net_path)
testnet=caffe.Net(net_path,caffe.TRAIN)
testnet.forward()
testnet.backward()
print("gradient after Flatten layer: ", testnet.blobs['flat'].diff)
print("gradient after output layer: ", testnet.blobs['output'].diff)
print("gradient after FC layer: ", testnet.blobs['fc'].diff)
if (np.all( testnet.blobs['output'].diff == testnet.blobs['fc'].diff)):
print("Network computed correctly.")
else:
print("Backpropagation failed!")
net_path_buggy='/tmp/buggy_net.prototxt'
net_path_notbuggy='/tmp/notbuggy_net.prototxt'
gen_network(net_path_buggy,trigger_bug=True)
gen_network(net_path_notbuggy,trigger_bug=False)
test_net(net_path_notbuggy)
test_net(net_path_buggy)
('Testing net: ', '/tmp/notbuggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC layer: ', array([[1., 1.]], dtype=float32))
Network computed correctly.
('Testing net: ', '/tmp/buggy_net.prototxt')
('gradient after Flatten layer: ', array([[1., 1.]], dtype=float32))
('gradient after output layer: ', array([[1., 1.]], dtype=float32))
('gradient after FC layer: ', array([[0., 0.]], dtype=float32))
Backpropagation failed!
Notes:
The issue appears both when computing gradients explicitly using net.backward() or implicitly using solver.step()
It makes no difference if the Reshape layer's shape is explicitly given or partially selfcomputed using shape=[0,...,-1]
Replacing the flat layer with an identical Reshape layer obscures the issue. So does replacing the Reshape layer with a second Flatten layer, as demonstrated.
Tried solutions
The bug is very obscure and hard to identify as a cause of training failure in the first place, especially in deep networks. If identified, some networks can be re-designed, for example using a Flatten layer instead of a Reshape layer in order to avoid triggering the issue.
In other cases, such as Convolutional LSTMs this workaround is NOT possible, since the specific shape required for RNN layers is not achievable using Flatten alone.
System configuration
Operating system: Most likely any ! Issue reproduced with:
Ubuntu Linux 16.04 LTS, Kernel 4.15
Ubuntu Linux 18.04 LTS, Kernel 4.15
~CUDA version (if applicable)~: Not applicable, Issue appears in both GPU and CPU only mode
~CUDNN version (if applicable)~: None
BLAS: any ! Issue reproduced with:
atlas
openblas
Python version (if using pycaffe): any ! Issue reproduced with
python 2.7
python 3.6
** The issue is expected to also appear in plain C++ implementations, since affected networks have failed to train using comandline caffe executable, but that has not been tested yet.
~MATLAB version (if using matcaffe)~: Not used
Issue checklist
[X] read the guidelines and removed the first paragraph
[X] written a short summary and detailed steps to reproduce
[X] explained how solutions to related problems failed (tick if found none)
[X] filled system configuration
[X] attached relevant logs/config files (tick if not applicable)
Issue summary
Networks using the Reshape layer fail to back-propagate when certain conditions are met. Instead the bottom layer's diff blob will include only zeros, regardless of the content of the Reshape layer's diff blob.
This issue is not reproducible when using a single Reshape layer alone. In the simplest reproducible combination, a Reshape layer followed by a Flatten layer will trigger the erroneous behavior.
However two functionally identical Reshape Layers or two Flatten layers will not!
The bug also does appear in more complex, production networks in different layer combinations and prevents training. All examples observed by the issue reporter include at least a Reshape layer, although depending on the underlying cause this does not necessarily rule out potential cases without.
Steps to reproduce
Expected behavior:
Actual behavior:
Demo:
Attached python script to generate a trivial network that triggers the error trigger_bug.py.txt
output.log
Notes:
Tried solutions
The bug is very obscure and hard to identify as a cause of training failure in the first place, especially in deep networks. If identified, some networks can be re-designed, for example using a Flatten layer instead of a Reshape layer in order to avoid triggering the issue. In other cases, such as Convolutional LSTMs this workaround is NOT possible, since the specific shape required for RNN layers is not achievable using Flatten alone.
System configuration
Issue checklist