Open ybch14 opened 7 years ago
have got the same issue. want to know why.
have got the same issue. Who knows?
It might not the fault of the batch sizes but the test iteration number rather. If it's errors of floating point calculation, it's been to some ridiculous extent.
See if this PR could help?
Hi, PENGUINLIONG @PENGUINLIONG. What do you mean by saying 'the test iteration number' is the fault? And seems the PR you mentioned is about he efficiency of calculating accuracy. What is the relationship?
See this line. The result for each iteration was summed up and then divided by the iteration number. Floating point numbers are only dense between -1 and 1. The more iterations done, the more floating point operation done, the higher error will be born.
Any change of codes might lead to different outcomes, especially for numeric computation. So you could try and see if it helps.
Hi, PENGUINLIONG So it's a kinda problem of calculating test accuracy? And when i changed batch_size from 1 to 3, the times that need to go through all test samples is divided into 1/3 which increased the test accuracy correspondingly? In that case, it seems that calculating test accuracy during training is not believable and we shouldn't do that during training. And the train phase calculated the accuracy in a different way?
I can't tell whether the result would go up or down. But what I can assert is, the more iteration/operation done, the more error born. Caffe didn't do anything wrong, on the contrary, it is doing things right: Adding up all (accuracy divided by iteration number) need 2N operations; but what Caffe do is to sum up accuracies first, and then divide them by the iteration number, which takes only N + 1 floating point operations. Obviously, it's less error-prone.
When to take that number is not important, I suppose. But consider that Accuracy computation is on CPU, and AFAIK Intel FPU has 80bits to represent a floating point number (so the accuracy is higher), there may be some different through these two methods. Which way is better? IDK. The error should never be significant.
Numeric error is somehow characteristics of floating point numbers and is inevitable. But, again, that amount of error was absurd.
I think Caffe should use different strategy for small and large iteration number respectively. I will check if it's actually where the errors from, recently. If it was the case, I will try to fix it and start a PR.
I have to wait for the previously mentioned PR to be merged or closed. So it may take some time, you may need to modify the codes by yourself currently.
Hi, PENGUINLIONG. According to your last comment, if the calculation of accuracy in Caffe is correct, it seems the affect of float numbers during large number of iteration can't be avoided. How can i modify the codes to workaround? And even if floating numbers may lose some precision, like you said, it shouldn't have that much effect right, like from 50% to 80%.
Yep, Caffe's accuracy layer has been used for a rather long period of time (It's been 2yrs since its last change). If there are critical issues, they would certainly have been reported already. I have literally no idea what has brought such error to us.
Just a few lines need to be changed.
Change this line to:
const float score = result_vec[k] / FLAGS_iterations;
And this line to:
const float mean_score = test_score[i];
Why it would work? Because the values FP numbers represented are discrete and limited. From -1 to 1, we have most of FP numbers represented. Although more FP operations are involved, but the number we stored in memory can be more precise.
I have to say, however, I'm pretty unsure if it will work, or to what extent, how many iterations, it would start to work. You could try it yourself temporarily.
In case you wonder why it was solver.cpp
in my last comment but caffe.cpp
in this comment: They work pretty much the same, while caffe.cpp
is for command line UI, i.e. caffe.exe
, which you have used.
Hi, PENGUINLIONG. I tried to change the code and recompiled cafe.(I used 'test phase' in train.prototxt instead of 'caffe test' command. So it seems that modifications in solver.cpp worked. ). And still the accuracy of test is very low with batch_size 1. Nothing has changed. :(
Geez, sorry, I have no clue about this now... I will try to figure out this later.
Hey, PENGUINLIONG. I think I might have found some clues. When i use an older version of caffe with 'libcaffe.so', the test accuracy is good with batch_size=1. While I use 'libcaffe.so.1.0.0-rc3', the batch_size of test phase would have effects on test accuracy. Not sure whether this would give you a hint.
@shanyucha Okay, I'll check it out later! Could you tell the version number or assumed publish date of old Caffe you have used?
Actually, I used caffe-segnet(a modified version of caffe for segnet). It's github repository location is: https://github.com/alexgkendall/caffe-segnet. I tried to figure out the version of caffe, but failed. It seems that caffe used git tag as version. And the last commit of caffe-segnet is about 2 years ago. So maybe the version of caffe is around that time.
The naming was changed since https://github.com/BVLC/caffe/pull/3311. There are two times that accuracy calculation codes were changed before (https://github.com/BVLC/caffe/pull/531 and https://github.com/BVLC/caffe/pull/615), but neither of them seem to go weird. So, I suppose, the accuracy itself is functioning normally.
Is it possible that the official caffe version always has the accuracy problem while the segnet-caffe modified the code and somehow changed the policy of calculating accuracy?
Checked that repo, accuracy.cpp
use the same mechanism as what we have currently. caffe.cpp
is also the same. Maybe the problem is not about the process of testing and accuracy calculating but all the network components. That sounds terrible.
I used the same network and different caffe version and the accuracy problem appeared. That's why i assumed the caffe version is the problem. But if as you said, they use the same mechanism, then I have no idea why the problem arises. :(
I reproduced the problem with deeplab-caffe and caffe-segnet-cudnn5 (https://github.com/TimoSaemann/caffe-segnet-cudnn5) while the latter is the newer version of caffe-segnet.
If you set use_global_stats: false during test (I can't see if your test.prototxt has this), the output of batch normalization layers will be different depending on the image batches (and thus the number of images per batch). Maybe the result is due to this.
I have the same problem using Amulet, which is based on Caffe SegNet. The same issue also arises with the newer version caffe-segnet-cudnn5. I get different results for different batch sizes. @shanyucha Did you solve this problem in the meantime?
@tkrahn108 nope. Still got the problem.
@VictorXunS train and test shared the same prototxt while i set test setting in 'test phase'. nothing specific. layer { name: "data" type: "DenseImageData" top: "data" top: "label" include { phase: TRAIN } dense_image_data_param { source: "train.list" # Change this to the absolute path to your data file batch_size: 9 # change this number to a batch size that will fit on your GPU shuffle: true } } layer { name: "data" type: "DenseImageData" top: "data" top: "label" include { phase: TEST } dense_image_data_param { source: "test.list" # Change this to the absolute path to your data file batch_size: 1 # Change this number to a batch size that will fit on your GPU shuffle: true } }
@shanyucha Can you try to replace every "use_global_stats: false" in the test prototxt by "use_global_stats: true" and see if your results still depend on batch size ?
@VictorXunS caffe-segnet modified the definition of param 'BatchNormParameter' and the param 'use_global_stats' does not exist. As the comment says, the 'use_global_stats' is set True during testing so that bn statistics are calculated. So i found the corresponding param, i.e. 'BNParamater' in the definition under caffer.prototxt of caffe-segnet: https://github.com/alexgkendall/caffe-segnet/blob/segnet-cleaned/src/caffe/proto/caffe.proto#L426
it seems that the 'bn_mode' works the same function as 'use_global_stats'. I think maybe i can try to set 'bn_mode' to 'INFERENCE' in test phase and it might solve the problem.
// Message that stores parameters used by BN (Batch Normalization) layer message BNParameter { enum BNMode { LEARN = 0; INFERENCE = 1; } optional BNMode bn_mode = 3 [default = LEARN]; optional FillerParameter scale_filler = 1; // The filler for the scale optional FillerParameter shift_filler = 2; // The filler for the shift }
@VictorXunS my bad. I think we are talking about 'phase: TEST' during training, i.e. validation instead of test. And you were talking about testing after all training and get the model?
We've received a rewrite of the Accuracy Layer after posting this issue (#5836). Does the issue still persist with the new layer?
If so it might be related to BatchNorm, if your network has it.
@shanyucha Yes, I meant test and not validation. For testing, I think that batchnorm parameter has to be changed to INFERENCE mode so that to get good results, especially if the test batch size is small. Otherwise the net normalizes over a too small batch sample output of batchnorm layers get shifted. I don't know if this can solver your problem though.
@VictorXunS I checked my inference.prototxt and the bn_mode is set to INFERENCE. However, my problem happened during validation phase. So it's not the same case as what you said i suppose.
@shanyucha any update for this issue, bro?
I think maybe you calculate the accuracy incorrectly. avg_accuracy=total_correct_samples/total_samples
; But notavg_accuracy=(acc1+acc2+...+accN)/N
I'm using caffe (the newest version) for facial expression recognition recently. I use the resnet-50 v1 model to do this. While when testing with trained model, I find that the test accuracy changes when I change batch size of image data layer in the TEST net. The train_val model and test_model definition is as follows:
Resnet_50_train_val.prototxt
Resnet_50_test.prototxt
In the
test.txt
there are 1600 pictures. And when I use caffe to test with batch size equals to 10 (test iterations is 160) in Resnet_50_test.prototxt, I got this result:Then I change batch size to 1 (test iterations is 1600), I got this result:
I'm really confused about this. Should batch size influence test result? Or is there any mistake in my experiment? Or is it a bug of caffe? Could anybody help me with it or tell me why? I really appreciate that!
System configuration
Operating system: Ubuntu 16.04 LTS Server Compiler: gcc-4.8.4 CUDA version (if applicable): cuda 8.0