NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Difference with BVLC caffe #512

Closed mathmanu closed 6 years ago

mathmanu commented 6 years ago

The following model works (gives good accuracy in BVLC caffe) https://github.com/cvjena/cnn-models/tree/master/ResNet_preact/ResNet10_cvgj

However, it doesn't work on NVIDIA/caffe (top-1 accuracy is zero)

Is blobs_[2] based scaling that is part of BVLC caffe implemented in NVIDIA/caffe? https://github.com/BVLC/caffe/blob/master/src/caffe/layers/batch_norm_layer.cu#L24 Couldn't find it in NVIDIA/caffe.

I suspect that this may not be the only reason for this mismatch.

whria78 commented 6 years ago

I also fail to finetune ResNet-50. (CUDA 9,1, cuDNN 7).

mathmanu commented 6 years ago

A friend of mine confirmed that if the issue that I reported regarding scaling based on blobs_[2] in BatchNorm layer is fixed, then NVIDIA/caffe works fine for this kind of network. The scaling that I mentioned is: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/batch_norm_layer.cu#L24

mathmanu commented 6 years ago

@drnikolaev could you ask an expert to look into this?

borisgin commented 6 years ago

Can you attach the nvcaffe log which has 0 accuracy please?

First of all nvidia/caffe has BatchNorm and ScaleBias fused in one layer ( see for example https://github.com/NVIDIA/caffe/blob/caffe-0.17/models/resnet18/train_val.prototxt

So instead of 2 layers in BVLC layer { name: "data_bn" type: "BatchNorm" bottom: "data" top: "data_bn" param { lr_mult: 0.0 } param { lr_mult: 0.0 } param { lr_mult: 0.0 } } layer { name: "data_scale" type: "Scale" bottom: "data_bn" top: "data_bn" param { lr_mult: 1.0 decay_mult: 1.0 } param { lr_mult: 2.0 decay_mult: 1.0 } scale_param { bias_term: true } } you can use one layer which does both BN and cudnn layer { name: "conv1/bn" type: "BatchNorm" bottom: "conv1" top: "conv1/bn" batch_norm_param { moving_average_fraction: 0.9 eps: 0.0001 scale_bias: true } }

mathmanu commented 6 years ago

Hi @borisgin, I understand the new BatchNorm (with scale_bias) in NVIDIA/caffe and have used it several times to speedup my training. But this issue is different.

These are the log files:

run_bvlccaffe.log run_nvcaffe.log

drnikolaev commented 6 years ago

@mathmanu we managed to reproduce and fix this but it looks like your dataset has different labeling. Therefore, we kindly ask you to verify the fix here: https://github.com/drnikolaev/caffe/tree/caffe-0.17 Release is coming soon.

mathmanu commented 6 years ago

The fix works. I have tested it and confirmed it. Looking forward for the release. Thank you.

mathmanu commented 6 years ago

@drnikolaev waiting for the release. Hoping that it will come soon.

drnikolaev commented 6 years ago

@mathmanu could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?

mathmanu commented 6 years ago

I have already verified the fix as I commented above. It works.

drnikolaev commented 6 years ago

thank you