Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.68k stars 316 forks source link

Help needed #113

Closed ghost closed 7 years ago

ghost commented 7 years ago

As you know I'm trying to get NNPACK working on Windows. There're still a few bugs left. I'm getting a weird access violation writing exception in the compute_kernel_transform function in convolution_inference. The problem is I don't see any obvious error in the code. The convolution_output code works nicely so I must have made some mistake in translating the kernel linear address. `

for (size_t output_channels_subblock_offset = 0ull; output_channels_subblock_offset < output_channels_subblock_size; output_channels_subblock_offset++) 
{
    const size_t output_channel = output_channels_subblock_start + output_channels_subblock_offset;
    transform_function(
        kernel + (output_channel * input_channels * kernel_size.width * kernel_size.height) + (input_channels_block_offset * kernel_size.width * kernel_size.height),
        kernel_transform + (output_channels_subblock_start * input_channels_block_size + input_channels_block_offset * output_channels_subblock_size + output_channels_subblock_offset) * tuple_size,
        kernel_size.width,
        input_channels_block_size * output_channels * tuple_size,
        uint32_t(kernel_size.height),
        uint32_t(kernel_size.width),
        0u, 
        0u);
}

`

Thanks, Filip

Maratyszcza commented 7 years ago

Your code looks correct to me. You may compile NNPACK with scalar backend (which is 100% C code) to check if there is a problem with PeachPy-generated micro-kernels

ghost commented 7 years ago

I remeber clearly I did a checkout of NNPACK on a fresh install of Ubuntu 17.10. I additionally installed only cmake and g++. The smoketest unit test passed but the more torough test failed. Then I installed the latest llvm 5.0 and did a rebuild and then all unit test were successfull. I'm not at home on linux and don't really understand the cause of this odd behaviour. I still get the same exception when compiling with the latest Intel compiler on Windows. Even with all threading fused-off. I also switched to a precise floating-point model in the compiler instead of the fast one because without I get a fourier unit test that's failing. If I implement the scalar backend on Windows correctly I doubt it will show this behaviour if the scalar backend already has proven itself on Linux. But I will gladly implement it to find out.

ghost commented 7 years ago

Hi,

I've also ported the scalar back-end now. All scalar unit tests are passing except the ft8x8 and ft16x16 kernels in the convolution part. They produce results that sometimes is close of passing except in convolution_inference where only the gemm and conv1x1 implementation runs without causing heap corruption. For me these results are puzzeling because I assume under Linux this scalar code passes nicely, and the scalar backend has obvious nothing to do with the PeachPy generated kernels. I'm out of ideas what is the cause of this result. I'm wondering if size_t under gcc in Linux is also an unsigned integer of 64bits as in Windows and concerning the heap corruption in convolution_inference I've noticed that the fxdiv functionality is the only thing I didn't unit check myself. All the other code in convolution_inference I've checked numerous times.

Maratyszcza commented 7 years ago

FXdiv might indeed be the culpit. It has several MSVC-specific code pieces, and they were only lightly tested. Could you build and run FXdiv unit tests on Windows? Now FXdiv has CMake configs, so it should be relatively easy. mkdir build && cd build && cmake .. && msbuild

ghost commented 7 years ago

Just did all the unit test of 64-bit windows build of Fxdiv. All tests are passed. Nothing wrong with Fxdiv on the MSVC. I've made a public repo of the nnpack scalar windows port if that could be of any help.

ghost commented 7 years ago

Good news! I've found the cause of the heap corruption in convolution_inference. MSVC doesn't allow pointer arithmetic on void pointers. I originally made those pointers of the float type without dealing with the difference between float and void pointer arithmetic. Everything passes the unit test now except all FT16x16 code paths in the convolution part. Do you have some idea what could be the cause the ft16x16 kernel doesn't passes any unit tests.

Thanks

Maratyszcza commented 7 years ago

FT16x16 are almost the same as FT8x8. All code paths except microkernels are the same.

ghost commented 7 years ago

I did some reading in Agner Fog's manual: http://www.agner.org/optimize/objconv-instructions.pdf

The calling conventions in 64-bit Windows and 64-bit Unix systems are different. Linux functions may modify registers RSI, RDI and XMM6 - XMM15, which must be preserved by Windows functions. You need a call stub to fix this incompatibility. 64-bit Unix systems allow functions to use a "red zone" of 128 bytes above the stack for local storage. Windows does not specify a red zone. This could produce extremely rare and irreproducible errors.

I don't know if PeachPy when using -mabi=ms -mimage-format=ms-coff handles these kind of things. I sadly can't read and understand assembly code when debugging:c A bit to low level for me.

Maratyszcza commented 7 years ago

-mabi=ms makes PeachPy use Microsoft calling convention, so unless there is a bug, it shouldn't be a problem. NNPACK includes unit tests for FFT functions, did you try to build them?

ghost commented 7 years ago

Yes, the fourier unit-tests all passes when using a precise floating-point model, when using the fast floating-point model all passes except FFT16_WITHIN_ROWS.match_reference fails with an error of 6.6e-08 when the limit is 1e-08.

ghost commented 7 years ago

I doubt this is some issue with the PeachPy generated kernels because the same "precision degradation" is also happening in the scalar port of NNPACK. The nnp_convolution_input_gradient don't give any good results on any code path in the scalar and also in the PeachPy kernels.

Maratyszcza commented 7 years ago

Maybe an error in setting transformed block size (e.g. missed break statement in a switch)? nnp_convolution_output uses all the same low-level functions as nnp_convolution_inference.

ghost commented 7 years ago

I made some progress. The ported scalar backend passes all unit tests except the convolution_input_gradient ones. I rechecked the source carefully but find no bug. The same happens also in the avx2 backend. I'm trying to get the ported code to build on Linux to see the same errors pop-up there as well. I modified a couple of the scripts but I get a compilation error on Linux. Would be nice to have NNPACK working under Windows also.

ghost commented 7 years ago

Some more progress. It's now possible to do a checkout of the port on Linux and run the unit tests. git clone https://github.com/zeno40/nnpack-windows.git cd nnpack-windows confu setup python ./configure.py ninja ninja smoketest

ghost commented 7 years ago

The issue with convolution_input_gradient is resolved. The cause was a typo I've made in the reference code. Now everything works except the FFT16 PeachPy kernels on windows for some unknown reason.

Maratyszcza commented 7 years ago

@zeno40 What was the typo? I'll fix it upstream

ghost commented 7 years ago

There is no typo in the original NNPACK reference code. I just made a mistake in the ported version.

ghost commented 7 years ago

Would be nice if NNPACK could support Windows natively without having to use a port. I don't know if the performance is affected somehow in the ported code versus the original NNPACK on Linux.

Maratyszcza commented 7 years ago

It may be possible. How is C99 support in Visual Studio these days?

ghost commented 7 years ago

There are a few incompatibilities in you're code to allow straight building on windows:

That's about it I guess. Nothing that's really awkward or completely undoable. It would give all the libraries that make use of NNPACK a welcome acceleration under Windows. That's one more good reason of using NNPACK. Feel free of using the ported code for copy-paste or as an example!

Maratyszcza commented 7 years ago

I could get rid of void* arithmetic, but designated initializers and hex-float literals are stow-stoppers. Manually partitioning memory chunks should be supported on Windows, without it it would be impossible to compile a memory allocator.

ghost commented 7 years ago

I want to inform you that I made a mistake about designated struct initializers not being supported on MSVC. They are fully supported in at least VS2017. As you also pointed-out very correctly about allocating memory and partition it freely as something mandatory. It is of course also fully supported on Windows as well as you probably already know. The only problems left are a bunch of hex floating-point literals and the FFT16x16 kernels not passing any unit test.

ghost commented 7 years ago

In the latest update of the MSVC backend hex floating-point literals are now also supported!