Test accuracy changes with test batch size

ybch14 commented 7 years ago

I'm using caffe (the newest version) for facial expression recognition recently. I use the resnet-50 v1 model to do this. While when testing with trained model, I find that the test accuracy changes when I change batch size of image data layer in the TEST net. The train_val model and test_model definition is as follows:

Resnet_50_train_val.prototxt

name: "ResNet-50"
layer {
    name: "data"
    type: "ImageData"
    top: "data"
    top: "label"
    include {
        phase: TRAIN
    }
    transform_param {
        scale:0.00390625
        mean_value:128
    }
    image_data_param {
        source: "media/finetuning_flip/train.txt"
        root_folder: "media/finetuning_flip/"
    new_height: 224
    new_width: 224
        is_color: false
        batch_size: 10
        shuffle: true
    }
}
layer {
    name: "data"
    type: "ImageData"
    top: "data"
    top: "label"
    include {
        phase: TEST
    }
    transform_param {
        scale:0.00390625
        mean_value:128
    }
    image_data_param {
        source: "media/finetuning_flip/val.txt"
        root_folder: "media/finetuning_flip/"
    new_height: 224
    new_width: 224
        is_color: false
        batch_size: 10
        shuffle: true
    }
}

layer {
    bottom: "data"
    top: "conv1"
    name: "conv1"
    type: "Convolution"
    convolution_param {
        num_output: 64
        kernel_size: 7
        pad: 3
        stride: 2
        weight_filler {
            type: "msra"
        }
        bias_term: true
    }
}

layer {
    bottom: "conv1"
    top: "conv1"
    name: "bn_conv1"
    type: "BatchNorm"
    batch_norm_param {
        use_global_stats: false
    }
}
...
layer {
    bottom: "pool5"
    top: "fc8"
    name: "fc8"
    type: "InnerProduct"
    param {
        lr_mult: 1
        decay_mult: 1
    }
    param {
        lr_mult: 2
        decay_mult: 1
    }
    inner_product_param {
        num_output: 8
        weight_filler {
            type: "xavier"
        }
        bias_filler {
            type: "constant"
            value: 0
        }
    }
}

layer {
    bottom: "fc8"
    bottom: "label"
    name: "loss"
    type: "SoftmaxWithLoss"
    top: "loss"
}

layer {
    bottom: "fc8"
    bottom: "label"
    top: "acc"
    name: "acc"
    type: "Accuracy"
    include {
        phase: TEST
    }
}

Resnet_50_test.prototxt

name: "ResNet-50"

layer {
    name: "data"
    type: "ImageData"
    top: "data"
    top: "label"
    include {
        phase: TEST
    }
    transform_param {
        scale:0.00390625
        mean_value:128
    }
    image_data_param {
        source: "media/finetuning_flip/test.txt"
        root_folder: "media/finetuning_flip/"
    new_height: 224
    new_width: 224
        is_color: false
        batch_size: 10
        shuffle: false
    }
}

layer {
    bottom: "data"
    top: "conv1"
    name: "conv1"
    type: "Convolution"
    convolution_param {
        num_output: 64
        kernel_size: 7
        pad: 3
        stride: 2
        weight_filler {
            type: "msra"
        }
        bias_term: true
    }
}
...
layer {
    bottom: "pool5"
    top: "fc8"
    name: "fc8"
    type: "InnerProduct"
    param {
        lr_mult: 1
        decay_mult: 1
    }
    param {
        lr_mult: 2
        decay_mult: 1
    }
    inner_product_param {
        num_output: 8
        weight_filler {
            type: "xavier"
        }
        bias_filler {
            type: "constant"
            value: 0
        }
    }
}

layer {
    bottom: "fc8"
    bottom: "label"
    name: "loss"
    type: "SoftmaxWithLoss"
    top: "loss"
}

layer {
    bottom: "fc8"
    bottom: "label"
    top: "acc"
    name: "acc"
    type: "Accuracy"
    include {
        phase: TEST
    }
}

In the test.txt there are 1600 pictures. And when I use caffe to test with batch size equals to 10 (test iterations is 160) in Resnet_50_test.prototxt, I got this result:

:/caffe-master$ ./build/tools/caffe test -model media/Resnet_50_test.prototxt -weights media/resnet_finetuning_flip_iter_43200_0.953499.caffemodel -gpu 2 -iterations 160
......
I0514 20:38:42.929600 591 caffe.cpp:330] acc = 0.9625

Then I change batch size to 1 (test iterations is 1600), I got this result:

:/caffe-master$ ./build/tools/caffe test -model media/Resnet_50_test.prototxt -weights media/resnet_finetuning_flip_iter_43200_0.953499.caffemodel -gpu 2 -iterations 1600
......
I0514 20:41:53.250874 710 caffe.cpp:330] acc = 0.770625

I'm really confused about this. Should batch size influence test result? Or is there any mistake in my experiment? Or is it a bug of caffe? Could anybody help me with it or tell me why? I really appreciate that!

System configuration

Operating system: Ubuntu 16.04 LTS Server Compiler: gcc-4.8.4 CUDA version (if applicable): cuda 8.0

shanyucha commented 7 years ago

have got the same issue. want to know why.

MichaelYSC commented 7 years ago

have got the same issue. Who knows?

PENGUINLIONG commented 7 years ago

It might not the fault of the batch sizes but the test iteration number rather. If it's errors of floating point calculation, it's been to some ridiculous extent.

See if this PR could help?

shanyucha commented 7 years ago

Hi, PENGUINLIONG @PENGUINLIONG. What do you mean by saying 'the test iteration number' is the fault? And seems the PR you mentioned is about he efficiency of calculating accuracy. What is the relationship?

PENGUINLIONG commented 7 years ago

See this line. The result for each iteration was summed up and then divided by the iteration number. Floating point numbers are only dense between -1 and 1. The more iterations done, the more floating point operation done, the higher error will be born.

Any change of codes might lead to different outcomes, especially for numeric computation. So you could try and see if it helps.

shanyucha commented 7 years ago

Hi, PENGUINLIONG So it's a kinda problem of calculating test accuracy? And when i changed batch_size from 1 to 3, the times that need to go through all test samples is divided into 1/3 which increased the test accuracy correspondingly? In that case, it seems that calculating test accuracy during training is not believable and we shouldn't do that during training. And the train phase calculated the accuracy in a different way?

PENGUINLIONG commented 7 years ago

I can't tell whether the result would go up or down. But what I can assert is, the more iteration/operation done, the more error born. Caffe didn't do anything wrong, on the contrary, it is doing things right: Adding up all (accuracy divided by iteration number) need 2N operations; but what Caffe do is to sum up accuracies first, and then divide them by the iteration number, which takes only N + 1 floating point operations. Obviously, it's less error-prone.

When to take that number is not important, I suppose. But consider that Accuracy computation is on CPU, and AFAIK Intel FPU has 80bits to represent a floating point number (so the accuracy is higher), there may be some different through these two methods. Which way is better? IDK. The error should never be significant.

Numeric error is somehow characteristics of floating point numbers and is inevitable. But, again, that amount of error was absurd.

PENGUINLIONG commented 7 years ago

I think Caffe should use different strategy for small and large iteration number respectively. I will check if it's actually where the errors from, recently. If it was the case, I will try to fix it and start a PR.

I have to wait for the previously mentioned PR to be merged or closed. So it may take some time, you may need to modify the codes by yourself currently.

shanyucha commented 7 years ago

Hi, PENGUINLIONG. According to your last comment, if the calculation of accuracy in Caffe is correct, it seems the affect of float numbers during large number of iteration can't be avoided. How can i modify the codes to workaround? And even if floating numbers may lose some precision, like you said, it shouldn't have that much effect right, like from 50% to 80%.

PENGUINLIONG commented 7 years ago

Yep, Caffe's accuracy layer has been used for a rather long period of time (It's been 2yrs since its last change). If there are critical issues, they would certainly have been reported already. I have literally no idea what has brought such error to us.

Just a few lines need to be changed.

Change this line to:

const float score = result_vec[k] / FLAGS_iterations;

And this line to:

const float mean_score = test_score[i];

Why it would work? Because the values FP numbers represented are discrete and limited. From -1 to 1, we have most of FP numbers represented. Although more FP operations are involved, but the number we stored in memory can be more precise.

I have to say, however, I'm pretty unsure if it will work, or to what extent, how many iterations, it would start to work. You could try it yourself temporarily.

In case you wonder why it was solver.cpp in my last comment but caffe.cpp in this comment: They work pretty much the same, while caffe.cpp is for command line UI, i.e. caffe.exe, which you have used.

shanyucha commented 7 years ago

Hi, PENGUINLIONG. I tried to change the code and recompiled cafe.(I used 'test phase' in train.prototxt instead of 'caffe test' command. So it seems that modifications in solver.cpp worked. ). And still the accuracy of test is very low with batch_size 1. Nothing has changed. :(

PENGUINLIONG commented 7 years ago

Geez, sorry, I have no clue about this now... I will try to figure out this later.

shanyucha commented 7 years ago

Hey, PENGUINLIONG. I think I might have found some clues. When i use an older version of caffe with 'libcaffe.so', the test accuracy is good with batch_size=1. While I use 'libcaffe.so.1.0.0-rc3', the batch_size of test phase would have effects on test accuracy. Not sure whether this would give you a hint.

PENGUINLIONG commented 7 years ago

@shanyucha Okay, I'll check it out later! Could you tell the version number or assumed publish date of old Caffe you have used?

shanyucha commented 7 years ago

Actually, I used caffe-segnet(a modified version of caffe for segnet). It's github repository location is: https://github.com/alexgkendall/caffe-segnet. I tried to figure out the version of caffe, but failed. It seems that caffe used git tag as version. And the last commit of caffe-segnet is about 2 years ago. So maybe the version of caffe is around that time.

PENGUINLIONG commented 7 years ago

The naming was changed since https://github.com/BVLC/caffe/pull/3311. There are two times that accuracy calculation codes were changed before (https://github.com/BVLC/caffe/pull/531 and https://github.com/BVLC/caffe/pull/615), but neither of them seem to go weird. So, I suppose, the accuracy itself is functioning normally.

shanyucha commented 7 years ago

Is it possible that the official caffe version always has the accuracy problem while the segnet-caffe modified the code and somehow changed the policy of calculating accuracy?

PENGUINLIONG commented 7 years ago

Checked that repo, accuracy.cpp use the same mechanism as what we have currently. caffe.cpp is also the same. Maybe the problem is not about the process of testing and accuracy calculating but all the network components. That sounds terrible.

shanyucha commented 7 years ago

I used the same network and different caffe version and the accuracy problem appeared. That's why i assumed the caffe version is the problem. But if as you said, they use the same mechanism, then I have no idea why the problem arises. :(

shanyucha commented 7 years ago

I reproduced the problem with deeplab-caffe and caffe-segnet-cudnn5 (https://github.com/TimoSaemann/caffe-segnet-cudnn5) while the latter is the newer version of caffe-segnet.

VictorXunS commented 6 years ago

If you set use_global_stats: false during test (I can't see if your test.prototxt has this), the output of batch normalization layers will be different depending on the image batches (and thus the number of images per batch). Maybe the result is due to this.

tkrahn108 commented 6 years ago

I have the same problem using Amulet, which is based on Caffe SegNet. The same issue also arises with the newer version caffe-segnet-cudnn5. I get different results for different batch sizes. @shanyucha Did you solve this problem in the meantime?

shanyucha commented 6 years ago

@tkrahn108 nope. Still got the problem.

shanyucha commented 6 years ago

@VictorXunS train and test shared the same prototxt while i set test setting in 'test phase'. nothing specific. layer { name: "data" type: "DenseImageData" top: "data" top: "label" include { phase: TRAIN } dense_image_data_param { source: "train.list" # Change this to the absolute path to your data file batch_size: 9 # change this number to a batch size that will fit on your GPU shuffle: true } } layer { name: "data" type: "DenseImageData" top: "data" top: "label" include { phase: TEST } dense_image_data_param { source: "test.list" # Change this to the absolute path to your data file batch_size: 1 # Change this number to a batch size that will fit on your GPU shuffle: true } }

VictorXunS commented 6 years ago

@shanyucha Can you try to replace every "use_global_stats: false" in the test prototxt by "use_global_stats: true" and see if your results still depend on batch size ?

shanyucha commented 6 years ago

@VictorXunS caffe-segnet modified the definition of param 'BatchNormParameter' and the param 'use_global_stats' does not exist. As the comment says, the 'use_global_stats' is set True during testing so that bn statistics are calculated. So i found the corresponding param, i.e. 'BNParamater' in the definition under caffer.prototxt of caffe-segnet: https://github.com/alexgkendall/caffe-segnet/blob/segnet-cleaned/src/caffe/proto/caffe.proto#L426

it seems that the 'bn_mode' works the same function as 'use_global_stats'. I think maybe i can try to set 'bn_mode' to 'INFERENCE' in test phase and it might solve the problem.

// Message that stores parameters used by BN (Batch Normalization) layer message BNParameter { enum BNMode { LEARN = 0; INFERENCE = 1; } optional BNMode bn_mode = 3 [default = LEARN]; optional FillerParameter scale_filler = 1; // The filler for the scale optional FillerParameter shift_filler = 2; // The filler for the shift }

shanyucha commented 6 years ago

@VictorXunS my bad. I think we are talking about 'phase: TEST' during training, i.e. validation instead of test. And you were talking about testing after all training and get the model?

Noiredd commented 6 years ago

We've received a rewrite of the Accuracy Layer after posting this issue (#5836). Does the issue still persist with the new layer?
If so it might be related to BatchNorm, if your network has it.

VictorXunS commented 6 years ago

@shanyucha Yes, I meant test and not validation. For testing, I think that batchnorm parameter has to be changed to INFERENCE mode so that to get good results, especially if the test batch size is small. Otherwise the net normalizes over a too small batch sample output of batchnorm layers get shifted. I don't know if this can solver your problem though.

shanyucha commented 6 years ago

@VictorXunS I checked my inference.prototxt and the bn_mode is set to INFERENCE. However, my problem happened during validation phase. So it's not the same case as what you said i suppose.

ndcuong91 commented 6 years ago

@shanyucha any update for this issue, bro?

FantDing commented 5 years ago

I think maybe you calculate the accuracy incorrectly. avg_accuracy=total_correct_samples/total_samples; But notavg_accuracy=(acc1+acc2+...+accN)/N

BVLC / caffe