facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

Caffe Translator Error: BatchNorm #472

Open KeyKy opened 7 years ago

KeyKy commented 7 years ago

I want to translate ResNet-152 into caffe2. However, I get this error:

KeyError: 'No translator registered for layer: name: "bn_conv1"\ntype: "BatchNorm"\nbottom: "conv1"\ntop: "conv1"\nbatch_norm_param{\n  use_global_stats: true\n}\n yet.'

Following is how i run caffe_translator.py:

python caffe_translator.py ResNet-152-deploy.prototxt ResNet-152-model.caffemodel 
--init_net resnet152_init.pb --predict_net resnet152_pred.pb
littleowl commented 7 years ago

I am also receiving an error rn trying to convert a ResNet 50:

KeyError: 'No translator registered for layer: name: "bn_1"\ntype: "BatchNorm"\nbottom: "conv_1"\ntop: "conv_1"\nparam {\n lr_mult: 0.0\n decay_mult: 0.0\n}\nparam {\n lr_mult: 0.0\n decay_mult: 0.0\n}\nparam {\n lr_mult: 0.0\n decay_mult: 0.0\n}\nbatch_norm_param {\n use_global_stats: true\n}\n yet.'

littleowl commented 7 years ago

I found a PR that seems to address this issue: https://github.com/caffe2/caffe2/pull/430

Finishes successfully. Hope it works!

KeyKy commented 7 years ago

@littleowl have you successfully run it? I get the following error:

Input index 0 and output idx 0 (conv1) are set to be in-place but this is actually not supported by op SpatialBN. [enforce fail at operator.cc:69] schema->Verify(operator_def). Operator def did not pass schema checking: input: "conv1" input: "conv1_scale" input: "conv1_bias" input: "conv1_mean" input: "conv1_var" output: "conv1" type: "SpatialBN" arg { name: "is_test" i: 1 } arg { name: "epsilon" f: 1e-05 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 3 }

KeyKy commented 7 years ago

Fix it with my prototxt because SpatialBN in caffe2 is not in-place layer. Howerver, when i running it, i get this warning:

W0502 17:55:53.081574 13078 conv_pool_op_base.h:554] You are hitting a case where Caffe's legacy padding calculation is hit. This leads to inefficient and sometimes incorrect results. We are keeping this behavior for backward compatibility, but you are strongly recommended to move away from it.

littleowl commented 7 years ago

@KeyKy I do actually get the same [enforce fail at operator.cc:69] error as you. I'll check out your prototxt - Thx!

littleowl commented 7 years ago

Analyzing your prototxt file - it seems as though for every BatchNorm layer you set the name to also be the top. I have done the same and everything seems OK so far. Not able to try running it just yet as I'm trying to get this working on iOS.

KleinYuan commented 7 years ago

@littleowl very helpful. Also tried that PR, it works on my translation.(Confirm)

Primus-zhao commented 7 years ago

@KeyKy, thanks for the exploring. However, there are too many batchnorm layers in resnet, did you change the prototxt by hand or script? If second, could you share it? Thks!

KeyKy commented 7 years ago

@Primus-zhao i change the prototxt by hand with netscope.

shahromil16 commented 7 years ago

@KeyKy I used your method to convert caffe to caffe2 for ssd (ref: https://github.com/KeyKy/caffe2/blob/master/caffe2/python/examples/ssd/) However, during detection, I am facing the same warnings: You are hitting a case where Caffe's legacy padding calculation is hit. This leads to inefficient and sometimes incorrect results. We are keeping this behavior for backward compatibility, but you are strongly recommended to move away from it.

This gives me false detection bounding boxes. Could you help me with that if you were able to solve the issue?

Thanks!

KeyKy commented 7 years ago

@rams16592 could you give me your image which has false detection bounding boxes. Do you try the image in original caffe ssd and compare the results? my email is 370846270@qq.com.

KeyKy commented 7 years ago

@rams16592 I have received your email. I found detection_out_op is slow because i implement it in cpu and ssd caffe has a gpu implementation. I will try it a few days later, hope it will get an improvement.

nyyznyyz1991 commented 7 years ago

@KleinYuan base on the script https://github.com/caffe2/caffe2/pull/430/files, i can translate my model with no error, but when i test my new caffe2 model , i find the feature output is wrong, the number is NaN or zero.Have you test the model you transfered?is it correct?

KleinYuan commented 7 years ago

@nyyznyyz1991 yes, I have the same issue and look into that.

KeyKy commented 7 years ago

Hi @rams16592 , After some hard work, I implemented the gpu detection_out_op and the benchmark. See the latest commit. It should be faster than before!!

shahromil16 commented 7 years ago

@KeyKy Thank you! I just saw that and tried. The cost for detection output has decreased for Jetson Tx1 too. However when I was benchmarking, I saw that the difference between the convOp for Caffe and Caffe2 is not much. Is that the same case with you still? I saw you too faced the same problem. (ref: https://github.com/caffe2/caffe2/issues/534)

KeyKy commented 7 years ago

@rams16592 Yes. This kind of difference also exists between [mxnet and caffe] (https://github.com/msracver/Deformable-ConvNets). Analogically between caffe2 and caffe. Now, what’s your detection speed in Jetson Tx1?

shahromil16 commented 7 years ago

@KeyKy I see now. Thanks for the update. I understood the reason.

littleowl commented 7 years ago

@nyyznyyz1991 @KleinYuan - I too have this issue with the NaN after patching the caffe_translator.py file and adapting the net structure with Netscope. One thing I noticed was that some of the layers of data contain really really small numbers like 0.0122e-6 or something like that. I have no idea if that mattered, but it got me to thinking that maybe there is something wrong with protobuf.

Looking at my setup, I had :

I was using 3.2 to do the translation and 3.1 for the implementation and the original files were maid with 2.* probably.

Not sure how to properly update protobuf binaries from 2. to 3. or even if it's a big deal or not. Anyone know if they are compatible?

So I wondered if maybe there was some incompatibilities going on. I then tried to do the translation using 2.6.1 from docker and tried that in iOS. Surprisingly, I no longer get NaN at all. That's good news, but not so much since instead I just get the incorrect values of 1.0 and 0.0 no matter what in my final layer (which only has a length of 2).

Obviously assuming there are going to be problems going from 2. to 3. Which means that I can try 2 things.

Totally not sure if I'm going down the right rabbit hole or not, but thought I would share my insights.