caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15)

mathmanu commented 7 years ago

The loss comes down slower and the final accuracy is also lower. Has anyone else observed similar issue? A friend of mine had another observation that the tendency of the loss to explode to nan is higher in caffe-0.16.

The same issue exists even if I don't use CUDNN. What could be the reason?

Thanks for your help.

CFAndy commented 7 years ago

The same trend on my side

drnikolaev commented 7 years ago

hi @mathmanu @ChenFengAndy what particular nets and datasets? Do you use Python layers?

CFAndy commented 7 years ago

Mine is Resnet50, no python layers.

drnikolaev commented 7 years ago

@ChenFengAndy do you observe the issue using muli-GPU setup? If so, do you use NVLink or straight PCIe?

CFAndy commented 7 years ago

Yes. NVLINK.

mathmanu commented 7 years ago

I don't use NVLINK. Only PCIe, two GTX1080 cards. I had this observation on image classification and segmentation networks.

When I saw the problem I was curious whether its related to multi-GPU, so I ran training with single GPU. If I recall correctly, the trend was similar there as well - but I am not completely sure now.

@ChenFengAndy, can you start the training with one GPU and see if the trend is similar with that?

drnikolaev commented 7 years ago

@mathmanu @ChenFengAndy thank you. I'll need some time to verify this. So far, quick AlexNet+ImageNet+cuDNN_v6+DGX-1 comparison between 0.15 and 0.16 shows that 0.16 trains it almost two times faster. We also observe performance boost on other nets. May I bother you to paste NVCaffe logs here (both 0.15 and 0.16)? That would help a lot.

mathmanu commented 7 years ago

May be there is a miscommunication. I was talking about the loss and accuracy. Not about speed.

drnikolaev commented 7 years ago

@mathmanu yeah, thanks for pointing to this! We actually have some accuracy&determinism improvements in the pipeline, you can give it a try here: https://github.com/drnikolaev/caffe/tree/caffe-0.16 If it's still not satisfactory please attach logs to this issue.

mathmanu commented 7 years ago

Thanks. I am working on it.

mathmanu commented 7 years ago

I have attached training logs that explain this issue. nvidia-caffe-issue-347-v1.zip Please see the train.log files. I tried both classification and segmentation scenarios.

Following are the results:

imagenet classification - top-1 accuracy:

nvidia/caffe(caffe-0.15) 2-gpu: 60.89%
drnikolaev/caffe(caffe-0.16) 2-gpu: 57.62%

Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations:

nvidia/caffe(caffe-0.15) 2-gpu: 90.54%
drnikolaev/caffe(caffe-0.16) 2-gpu: 88.20%
drnikolaev/caffe(caffe-0.16) 1-gpu: 89.53%

I also have (but not attached) the full training logs for some (but not all) of the above segmentation scenarios which shows lower final accuracy in caffe-0.16.

Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

(For segmentaion, I used a custom ImageLabelData layer - especially needed in caffe-0.15, which did not have fixed random seed for the DataLayer - source code for the new layer is also included in the attached zip file).

Let me know if you need any other information.

Btw, thankyou for all the great work that you are doing - I get about 25% speedup when using caffe-0.16.

drnikolaev commented 7 years ago

Hi @mathmanu thank you very much for detailed report. You are right, accuracy first and we do test it. Seems like we missed something here. Marked as a bug, work in progress...

mathmanu commented 7 years ago

Thanks. Kindly review my ImageLabelData layer as well and let me know if I missed anything.

mathmanu commented 7 years ago

I just noticed that the BatchNorm parameters that used for the logs that I shared are not correct for caffe-0.16 (which needs slightly different parameters).

I will correct these and give a run - but it takes too much time for me to train as I have just 2 1080s - if you could try it in your DGX1, after correcting BN params, that will be great.

I have noticed the issue even if I use the correct BN parameters.

drnikolaev commented 7 years ago

Is this similar: https://github.com/NVIDIA/caffe/issues/276#issuecomment-289220197 ? @borisgin could you have a look please? @mathmanu sure, i'll run it tomorrow.

mathmanu commented 7 years ago

Hold on - I will update the results with corrected params tomorrow.

mathmanu commented 7 years ago

I have re-run the simulations after correcting the params for new BN. The issue is very much there and the conclusions remain unchanged.

imagenet classification - top-1 accuracy: nvidia/caffe(caffe-0.15) 2-gpu: 60.89% drnikolaev/caffe(caffe-0.16) 2-gpu: 57.56% Conclusion: caffe-0.16 achieves lower classification accuracy.

cityscapes segmentation - pixel accuracy trend after 2000 iterations: nvidia/caffe(caffe-0.15) 2-gpu: 90.54% drnikolaev/caffe(caffe-0.16) 2-gpu: 88.43% Conclusion: the training loss drops down very slowly in caffe-0.16 and the final segmentation accuracy achieved is also lower.

The logs are in train.log files in the following attachment: nvidia-caffe-issue-347-v2.zip

Looking forward for a solution. Thanks.

cliffwoolley commented 7 years ago

Thanks for the report, @mathmanu . We're looking into this.

Best, Cliff

/cc @thatguymike @slayton58

drnikolaev commented 7 years ago

@mathmanu @ChenFengAndy - we have reproduced and fixed the issue. Thanks again for reporting it. We are working on a new release now but if you want to get early access to the fix, please clone https://github.com/drnikolaev/caffe/tree/caffe-0.16 - it's still under construction but it does produce the same accuracy as 0.15 (at least on those nets we tested so far), like this one:

0 16 fixed

mathmanu commented 7 years ago

Great! I'll wait for the release.

mathmanu commented 7 years ago

As far as I understand from the fix (in BN), it only changes the output of test/validation. So if I run test with my previous model (trained in caffe-0.16 which had this bug), using the bug fixed version, i should get the expected correct accuracy - is that right?

borisgin commented 7 years ago

No. The bug was in the code where local learning rate was set for scale and bias in the BN layers. You have to retrain the model .

mathmanu commented 7 years ago

Thank you. I hope the CUDNN BN will get integrated into BVLC/caffe soon.

NVIDIA / caffe

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347