andravin / wincnn

Winograd minimal convolution algorithm generator for convolutional neural networks.
Apache License 2.0
600 stars 145 forks source link

Is Max Pooling possible in Winograd domain? If not, why? #19

Closed aquibjamal closed 5 years ago

aquibjamal commented 5 years ago

Is there any reference where max pooling is being done in Winograd domain? If max pooling is not possible then what could be the reason?

andravin commented 5 years ago

Not sure Max Pooling is the best pooling option, but more generally, pooling in the spectral domain was examined in "Spectral Representations for Convolutional Neural Networks" http://papers.nips.cc/paper/5649-spectral-representations-for-convolutional-neural-networks

That paper was published while I was working on the "Fast Algorithms for Convolutional Neural Networks" paper, but I did not see it at the time.

I suspect there are interesting ways to combine the ideas in these two papers, so I look forward to reading about it in your research. :-)

andravin commented 5 years ago

Perhaps something closer to your line of thinking, the paper "Efficient Sparse-Winograd Convolutional Neural Networks" successfully applied the ReLU activation function in the Winograd domain. https://arxiv.org/abs/1802.06367

andravin commented 5 years ago

.. but a possible issue with pooling directly in the Winograd domain will be aliasing. The standard signal processing approach to downsampling is to suppress high frequency components before subsampling. The first paper I cited above takes this approach.

andravin commented 5 years ago

It is interesting to think about the relationship between Winograd transforms and average pooling.

The component of F(2x2,3x3) corresponding to the polynomial root '0' in both spatial dimensions is just the sum of the 2x2 center elements in the 4x4 input tile. There is a factor of 1/4 that is normally associated with the corresponding filter component, but it can be moved to the input component without changing the algorithm. This makes the input component equal to average pooling over a 2x2 tile with stride 2x2.

So truncating all of the F(2x2,3x3) input components except for that one is equivalent to average pooling.

The highest frequency component of F(2,3) corresponds to the polynomial root '-1'. It is the nyquist frequency for the original signal and so must be truncated to avoid aliasing before subsampling.

The remaining 2 components of F(2,3) have a wavelength of 4 pixels (they are just reflected shifts of each other, even and odd subbands of the same filter?) Anyway it would not seem necessary to truncate these before downsampling.

Note from the output transform that these 2 components affect different output pixels. So it seems you could just keep one of them depending on how you wanted the subsampled output to be shifted, or maybe you could average them to put the subsampled output in the middle.

So you could try just truncating the highest frequency Winograd components and weighting the rest to experiment with different low pass filters.

Hope this rambling essay was interesting to somebody. ;-) There are more connections to be made, and I will document them some day in a more formal setting.

andravin commented 5 years ago

Also, when we force the high frequency component of F(2,3) to equal zero, we can interpret that as forcing the corresponding filter component to equal zero.

Let's see how this affects the Winograd filter transform. First take the F(2,3) filter transform, and move the factors of 1/2 to the data transform:

G = 
⎡ 1    0    0 ⎤
⎢             ⎥
⎢ 1    1    1 ⎥
⎢             ⎥
⎢ 1   -1    1 ⎥
⎢             ⎥
⎣ 0    0    1 ⎦

The third row of the matrix corresponds to the high frequency component from the input transform. If the raw filter (in 1-D) equals (g1, g2, g3), then setting the 3rd filter component equal to zero requires g1 - g2 + g3 = 0, or g2 = g1 + g3.

Therefore we can rewrite the Winograd filter transform in terms of 2 variables, g1 and g3. The constrained Winograd transform is now:

G = 
⎡ 1    0 ⎤
⎢        ⎥
⎢ 2    2 ⎥
⎢        ⎥
⎣ 0    1 ⎦

For the 2D transform, it would be 2x2=4 variables (instead of 3x3=9).

Note 3-tap filters with g2 = g1 + g3 are low-pass filters.

aquibjamal commented 5 years ago

Thank you for your valuable inputs ! I would like to know if I can stay in Winograd domain between multiple layers of Convolutions. As per my understanding for convulational layer 1 with input d and filter g, the output of this layer is Y=A'[[GgG']⊙[B'dB]]A. For next layer we have to transform Y, transform next layer filters g_layer_2, perform element wise multiplication and take the inverse transform of the product. Instead of transforming Y, can we stay in Winograd domain and just multiply [GgG']⊙[B'dB] with transform of g_layer_2? Is it possible to perform convolutions in Winograd domain for consecutive convolutional layers and take inverse transform after the last convolutional layer?

andravin commented 5 years ago

Well, when using Winograd algorithms, we first break the input tensor into a number of small overlapping tiles. Then we use the the algorithms to perform convolution on each of these tiles independently.

So if you stay in Winograd space and perform another set of matrix multiplies on the Winograd components, you are not getting any more overlap from neighboring tiles. If you continue like this, you are basically running the network on each tile independently without mixing information between tiles.

So it seems you would need to add an operation that mixes information from neighboring tiles.

This "mixing" operation would normally be the Winograd data transform of the next convolution. You could in principle fuse an inverse transform, activation function, and data transform together into one operation.

We also should notice that the data transform expands the overlapping input tiles so that the number of Winograd components is greater than the input size. For F(2x2,3x3), the Winograd components are 4x as many as the input elements. F(4x4,3x3) expands the input by a factor of 2.25. The larger the tile, the less the expansion .. but the more memory you need to store a given number of tiles (ie, smaller tiles yield more data locality).

Anyway, hopefully I gave you a number of ideas to experiment with.

aquibjamal commented 5 years ago

Thank you for your comments !