Implement Winograd optimization properly

bjin commented 6 years ago

I tried to implement 3x3 Winograd convolution algorithm months ago, based on the scripts from the wincnn repo. In theory it could make 3x3 convolution2d layer (in CNN models) 2.25 times faster. It will benefit mostly ResNet based models, but could also help other CNN models.

The initial result is not promising, it's slower than the naive implementation. The code can be found in the conv2d-slow branch. There are two approaches that I tried

commit 9324c5a4e4c4712e6fd9c81606e63f8f1a90d75f uses 2x2 group and mat4-vec4 multiplication, and just slightly slower than naive approach
commit ea938726e3b8371fd5c38f3cefc20a36e5e39ca4 uses 4x4 group and mat4-mat4 multiplication, and about 2 times slower.

A proper implementation requires finding the overhead first. Some low level primitivess might also be required, probably from some vendor specific extensions.

kkkrackpot commented 6 years ago

Will it be something like RAVU? Sorry, just interesting...

bjin commented 6 years ago

@fhlfibh It's just an attempt for general speedup of all CNN super resolution algorithms. I was playing with some CNN model back in September, but the model size is kind of large so I started this first. Now I don't have much time working on it, so I hope someone could finish it, or make use of the existing code (if any).

bjin / mpv-prescalers

Implement Winograd optimization properly #27