Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.67k stars 317 forks source link

Issue with nnp_conv1x1_upto_2x4__scalar #130

Closed kikoland closed 6 years ago

kikoland commented 6 years ago

Hi and thanks for your great library.

I've been playing with Pixelwise (1x1) convolutions.

It seems there is an issue when block size (for output) are not multiple of 4. This happens for example when building logits at the end of an inference (input 512xWxH, output 2xWxH).

Here is an illustration of the problematic call to the function (scalar case, I didn't explore simd or others yet):

nnp_conv1x1_upto_2x4__scalar(input_channels_subblock_size=2,  output_channels_subblock_size=2, ...)

Then, kernel20 (and kernel30) is not initialized but used later in:

        const float input0 = *pinput0++;
        output0 += kernel00 * input0;
        output1 += kernel10 * input0;
        output2 += kernel20 * input0;
        output3 += kernel30 * input0;

Moreover output2 and output3 will access wrong memory addresses.

Would the following fix solve the issue? Not very efficient though...

        const float input0 = *pinput0++;
        output0 += kernel00 * input0;
        if (output_channels_subblock_size > 1)
            output1 += kernel10 * input0;
        if (output_channels_subblock_size > 2)
            output2 += kernel20 * input0;
        if (output_channels_subblock_size > 3)
            output3 += kernel30 * input0;

        if (input_channels_subblock_size > 1) 
        {
            const float input1 = *pinput1++;
            output0 += kernel01 * input1;
            if (output_channels_subblock_size > 1 && input_channels_subblock_size > 1)
                output1 += kernel11 * input1;
            if (output_channels_subblock_size > 2 && input_channels_subblock_size > 2)
                output2 += kernel21 * input1;
            if (output_channels_subblock_size > 3 && input_channels_subblock_size > 3)
                output3 += kernel31 * input1;
        }
Maratyszcza commented 6 years ago

If output_channels_subblock_size <= 2, then kernel20 and kernel21 are never initialized. They are used in accumulating output2, but output2 is not written to the output. This is an optimization to avoid initializing kernel20/kernel21 or having extra conditionals in the code. It relies on undefined behavior, so a compiler could generate wrong code if it tries to be too smart.

kikoland commented 6 years ago

Thanks! not sure to fully understand how a standard compiler reacts to that. It crashes with visual studio internal compiler (but the fix seems to work and that's enough for my own tests).

Maratyszcza commented 6 years ago

Probably less costly fix would be to zero-initialize all kernelXY variables.