What is the rational behind using F(6x6, 3x3) for Winograd tile size?

Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs

BSD 2-Clause "Simplified" License

1.67k stars 317 forks source link

For a kernel of size KxK and Winograd tile of size TxT, we have to do TxT multiply-adds per each channel, but we get only (T-K+1)x(T-K+1) outputs. Thus, the larger the tile, the fewer multiply-adds we do per each output. E.g. for F(6x6, 3x3) we do accumulations for 8x8 tiles (64 elements), and then transform them into 6x6 output tiles (36 elements). This reduces efficiency of Winograd from theoretical 1 multiply-add per output to ~1.78 multiply-adds per output. If NNPACK used F(2x2, 3x3) tiles, it would do 4x4 / 2x2 = 4 multiply-adds per output. Increasing Winograd tile even further beyond 8x8 would provide additional savings in computations, but results would get drastically less accurate.

Maratyszcza / NNPACK

What is the rational behind using F(6x6, 3x3) for Winograd tile size? #122