Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.67k stars 317 forks source link

FFT16x16 kernels #129

Closed ghost closed 6 years ago

ghost commented 6 years ago

Fist of all, thanks for mentioning the repo of nnpack-windows! I'm still trying to resolve the only issue left: getting the FFT16x16 kernels working under Windows as the do under Linux, OS X, etc. The first thing I did was to check if a scalar version was working as expected under Windows. It did pass all unit tests. Then I checked if the ported AVX2 version was behaving the same under linux. With the kernels peachpy generates under Linux everyting passed the unit tests. So basicly the object PeachPy is emitting in the 2d-fourier-16x16.py script under Windows doesn't behave as it do under Linux. I was looking at the results of the unit test for the FFT16x16 kernels under Windows and noticed although the tests mostly clearly fails, it was never by an very big margin. Then I disabled the use of denormals in the init.c file

// Flush denormals to zero (the FTZ flag). _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // Interpret denormal inputs as zero (the DAZ flag). _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

After recompiling I noticed that the results of the FFT16x16 kernels also fails but now they fail with mostly a very huge margin. So the same 2d-fourier-16x16.py script generates an object under Windows that uses and/or produces a lot of denormal float values. Something the FFT8x8 or Winograd kernels don't have to worry about and some thing that's not happening under Linux, OSX,... Do you have an idea what's causing this strange behaviour? I rechecked the ported code many times and don't find any lingering bug or some typo that could cause this behaviour. I tried with all compiler optimizations disabled and all sorts of floating point settings. I did found a small bit of redundant code in the fft16x16.py script: 6: from common import butterfly, sqrt2_over_2 9: from common import butterfly, sqrt2_over_2, cos_npi_over_8, interleave

I hardly know the python language but I'm fairly confident this don't mean the same functions are imported twice. So no bug here.

ghost commented 6 years ago

The issue still exists but I bypass the AVX2 FT16x16 kernels under Windows with the corresponding psimd implementation. The speed is obvious a bit lower but in the end it's better to have a functional FT16x16 kernel.