Here are transforms for Winograd F(4x4, 5x5). That means a 5x5 kernel with a 4x4 output tile. I imagine the code should be a relatively simple adaptation of F(6x6, 3x3), because both algorithms use 8x8 input tiles.
It uses 8x8/(4x4)=4 multiplies / output which gives it a max theoretical speedup of 5x5/4 = 6.25.
I should note the theoretical speedup of the 16x16 tile FFT is much better with 5x5 kernels, but still the Winograd algorithm might be advantageous in situations where the smaller tile size helps with efficiency.
Here are transforms for Winograd F(4x4, 5x5). That means a 5x5 kernel with a 4x4 output tile. I imagine the code should be a relatively simple adaptation of F(6x6, 3x3), because both algorithms use 8x8 input tiles.
It uses 8x8/(4x4)=4 multiplies / output which gives it a max theoretical speedup of 5x5/4 = 6.25.