Inconsistent results after upgrading ncnn

mtamburrano commented 3 years ago

I use since a while an MTCNN network to extract faces and landmarks from images. The original model is converted from CAFFE without issues with caffe2ncnn tool.

Now, I updated ncnn and I have 2 issues:

My unit tests are now failing because the landmarks extracted have different coordinates. Some of them have an offset of 6-7 pixels.
There is some non-deterministic behavior between runs. The landmarks are different sometimes (and wrong even looking at them, for example the mouth coordinates are near the chin).

So I tried to figure out what is happening and I managed to find the first working (for my case) release. I found that the release 20200727 returns the values I were used to get (so my tests are green again). From release 20200916 onwards I get this different behavior. So I built the library against almost every commit between the 2 releases and I found out that the "offending" commit is this one: https://github.com/Tencent/ncnn/commit/d04c05651f1dda38d0330fc6b7334b8e47cd1246

It looks like it is something related to the interp layer. The interesting part is that in my converted model there is not such a layer, at least I cannot find it in the ncnn.param file.

So, do you have any idea what is happening here? Thank you in advance

nihui commented 3 years ago

From version 20200916, fp16 storage and arithmetic are enabled by default on supported platforms. You can disable them via

net.opt.use_fp16_storage = false;
net.opt.use_fp16_arithmetic = false;

before loading your model param

mtamburrano commented 3 years ago

thank you @nihui, I tried but unfortunately I've got the same results. Anyway as I said is the commit https://github.com/Tencent/ncnn/commit/d04c05651f1dda38d0330fc6b7334b8e47cd1246 to break things and I cannot see anything there related to fp16 storage and arithmetics. That commit changes something in the interp layer, as far as I know it is also used in the resize functions.

So probably I identified the source of the issue, and it is in my preprocessing code, not in the forward pass. More exactly, my code looks like:

ncnn::Mat input;
input.create(24, 24, image.c * 5, image.elemsize);
...
ncnn::Mat channels = input.channel_range(image.c * i, image.c);
ncnn::resize_bilinear(pad, channels, 24, 24);
...

Here we are doing some stuff to fill the input Mat. Using the old release, input was filled correctly, but with the new version this doesn't seem the case anymore, input is left untouched. I'm still investigating why...

mtamburrano commented 3 years ago

I can confirm that the issue is what I reported. This patch:

ncnn::Mat channels = input.channel_range(image.c * i, image.c);
ncnn::resize_bilinear(pad, channels, 24, 24);
for(int c = 0; c < channels.c; ++c) {   
    float* ptr = input.channel(image.c * i + c);
    for(int w = 0; w < 24; ++w){
        for(int h = 0; h < 24; ++h){
            *(ptr + (24*w) + h) = channels.channel(c).row(w)[h];
        }
    }
}

fixes everything (I'm sorry, I'm sure that there is a better "nncnonic" way to fill the Mat but I'm pretty new to it and I'm not sure how to properly do it).

So, for some reason, resize_bilinear fills the Mat channels, but it is not treated as a reference to a portion of input anymore, like it was in the past with the old library.

mtamburrano commented 3 years ago

here: https://github.com/Tencent/ncnn/blob/34bd5ef16151417da32e1c258f5c55cd8501434e/src/layer/interp.cpp#L447 the address of the data pointer of the Mat is being changed, that's why if we pass to this function a reference of another mat, the first mat is not updated. Also I see some memory leak risk here. Is this the expected behavior? It didn't worked like that in the past

Tencent / ncnn

Inconsistent results after upgrading ncnn #2879