Closed mathmanu closed 6 years ago
I also fail to finetune ResNet-50. (CUDA 9,1, cuDNN 7).
A friend of mine confirmed that if the issue that I reported regarding scaling based on blobs_[2] in BatchNorm layer is fixed, then NVIDIA/caffe works fine for this kind of network. The scaling that I mentioned is: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/batch_norm_layer.cu#L24
@drnikolaev could you ask an expert to look into this?
Can you attach the nvcaffe log which has 0 accuracy please?
First of all nvidia/caffe has BatchNorm and ScaleBias fused in one layer ( see for example https://github.com/NVIDIA/caffe/blob/caffe-0.17/models/resnet18/train_val.prototxt
So instead of 2 layers in BVLC layer { name: "data_bn" type: "BatchNorm" bottom: "data" top: "data_bn" param { lr_mult: 0.0 } param { lr_mult: 0.0 } param { lr_mult: 0.0 } } layer { name: "data_scale" type: "Scale" bottom: "data_bn" top: "data_bn" param { lr_mult: 1.0 decay_mult: 1.0 } param { lr_mult: 2.0 decay_mult: 1.0 } scale_param { bias_term: true } } you can use one layer which does both BN and cudnn layer { name: "conv1/bn" type: "BatchNorm" bottom: "conv1" top: "conv1/bn" batch_norm_param { moving_average_fraction: 0.9 eps: 0.0001 scale_bias: true } }
Hi @borisgin, I understand the new BatchNorm (with scale_bias) in NVIDIA/caffe and have used it several times to speedup my training. But this issue is different.
These are the log files:
@mathmanu we managed to reproduce and fix this but it looks like your dataset has different labeling. Therefore, we kindly ask you to verify the fix here: https://github.com/drnikolaev/caffe/tree/caffe-0.17 Release is coming soon.
The fix works. I have tested it and confirmed it. Looking forward for the release. Thank you.
@drnikolaev waiting for the release. Hoping that it will come soon.
@mathmanu could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?
I have already verified the fix as I commented above. It works.
thank you
The following model works (gives good accuracy in BVLC caffe) https://github.com/cvjena/cnn-models/tree/master/ResNet_preact/ResNet10_cvgj
However, it doesn't work on NVIDIA/caffe (top-1 accuracy is zero)
Is blobs_[2] based scaling that is part of BVLC caffe implemented in NVIDIA/caffe? https://github.com/BVLC/caffe/blob/master/src/caffe/layers/batch_norm_layer.cu#L24 Couldn't find it in NVIDIA/caffe.
I suspect that this may not be the only reason for this mismatch.