Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.68k stars 317 forks source link

Winograd F(4x4, 5x5) #12

Open andravin opened 8 years ago

andravin commented 8 years ago

Here are transforms for Winograd F(4x4, 5x5). That means a 5x5 kernel with a 4x4 output tile. I imagine the code should be a relatively simple adaptation of F(6x6, 3x3), because both algorithms use 8x8 input tiles.

It uses 8x8/(4x4)=4 multiplies / output which gives it a max theoretical speedup of 5x5/4 = 6.25.

AT =
⎡1  1  1   1  1   8  8   0⎤
⎢                         ⎥
⎢0  1  -1  2  -2  4  -4  0⎥
⎢                         ⎥
⎢0  1  1   4  4   2  2   0⎥
⎢                         ⎥
⎣0  1  -1  8  -8  1  -1  1⎦

G =
⎡ 1      0     0      0      0  ⎤
⎢                               ⎥
⎢-2/9  -2/9   -2/9  -2/9   -2/9 ⎥
⎢                               ⎥
⎢-2/9   2/9   -2/9   2/9   -2/9 ⎥
⎢                               ⎥
⎢1/90  1/45   2/45  4/45   8/45 ⎥
⎢                               ⎥
⎢1/90  -1/45  2/45  -4/45  8/45 ⎥
⎢                               ⎥
⎢4/45  2/45   1/45  1/90   1/180⎥
⎢                               ⎥
⎢4/45  -2/45  1/45  -1/90  1/180⎥
⎢                               ⎥
⎣ 0      0     0      0      1  ⎦

BT =
⎡1   0    -21/4    0    21/4     0    -1  0⎤
⎢                                          ⎥
⎢0   1      1    -17/4  -17/4    1    1   0⎥
⎢                                          ⎥
⎢0   -1     1    17/4   -17/4   -1    1   0⎥
⎢                                          ⎥
⎢0  1/2    1/4   -5/2   -5/4     2    1   0⎥
⎢                                          ⎥
⎢0  -1/2   1/4    5/2   -5/4    -2    1   0⎥
⎢                                          ⎥
⎢0   2      4    -5/2    -5     1/2   1   0⎥
⎢                                          ⎥
⎢0   -2     4     5/2    -5    -1/2   1   0⎥
⎢                                          ⎥
⎣0   -1     0    21/4     0    -21/4  0   1⎦
andravin commented 8 years ago

I should note the theoretical speedup of the 16x16 tile FFT is much better with 5x5 kernels, but still the Winograd algorithm might be advantageous in situations where the smaller tile size helps with efficiency.