XportDNN tool having an error when trying to convert a model using Xilinx quantization method

DanielsJusts commented 5 years ago

Hello!

I'm having trouble using XportDNN tool when using "Xilinx" quantization method. I also posted on Xilinx Forums, but I had no response there, so I thought I might try to get help here.

Here is a link to my Google Drive folder, where you can access the .prototxt and .caffemodel files of my model, if anyone needs them.

The Python script throws an error:

Traceback (most recent call last):
  File "./tools_binaries/quantize.py", line 1307, in <module>
    print sys.argv[0], "--deploy_model", args.deploy_model, "--weights", args.weights, "--quantized_deploy_model", args.quantized_deploy_model, "--quantized_weights", args.quantized_weights, "--calibration_directory", args.calibration_directory, "--calibration_size", args.calibration_size, "--calibration_indices", calibration_indices_str, "--bitwidths", args.bitwidths, "--dims", args.dims, "--transpose", args.transpose, "--channel_swap", args.channel_swap, "--raw_scale", args.raw_scale, "--mean_value", args.mean_value, "--input_scale", args.input_scale, "--gpu "
  File "./tools_binaries/quantize.py", line 1299, in main

  File "./tools_binaries/quantize.py", line 1059, in execute_calibration
    net, bw_layer_in, th_layer_in, bw_layer_out, th_layer_out = quantize_flatten_reshape(name, net, net_parameter, bw_layer_in, th_layer_in, bw_layer_out, th_layer_out)
  File "./tools_binaries/quantize.py", line 249, in quantize_batchnorm
    data = net.blobs[net.top_names[name][0]].data[...]
  File "./tools_binaries/quantize.py", line 102, in ThresholdLayerOutputs_cpu

  File "./tools_binaries/quantize.py", line 40, in compute_threshold
    mn = 0
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/numpy/lib/histograms.py", line 710, in histogram
    bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/numpy/lib/histograms.py", line 333, in _get_bin_edges
    first_edge, last_edge = _get_outer_edges(a, range)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/numpy/lib/histograms.py", line 253, in _get_outer_edges
    "supplied range of [{}, {}] is not finite".format(first_edge, last_edge))
ValueError: supplied range of [0, inf] is not finite

I have no clue as to where to look for the issue. Only "weird" thing I noticed is that the threshold values for the layer tend to grow bigger from layer to layer, so that lead me to believe that the issue might come from there, but I don't know how to debug that. An example:

--------------------------------------------------------------------------------
0 95
data, Input
[], ['data']
bw_layer_out:  8
th_layer_out:  150.8812280483544
--------------------------------------------------------------------------------
1 95
conv1, Convolution
['data'], ['conv1']
bw_layer_in:  8
th_layer_in:  150.8812280483544
bw_layer_out:  8
th_layer_out:  265.59699538722634
--------------------------------------------------------------------------------
2 95
bn1, BatchNorm
['conv1'], ['bn1']
bw_layer_in:  8
th_layer_in:  265.59699538722634
bw_layer_out:  8
th_layer_out:  62792.91298723221
--------------------------------------------------------------------------------
3 95
scale1, Scale
['bn1'], ['scale1']
bw_layer_in:  8
th_layer_in:  62792.91298723221
bw_layer_out:  8
th_layer_out:  326706.5521259308
--------------------------------------------------------------------------------
4 95
relu1, ReLU
['scale1'], ['scale1']
--------------------------------------------------------------------------------
5 95
pool1, Pooling
['scale1'], ['pool1']
bw_layer_in:  8
th_layer_in:  326706.5521259308
bw_layer_out:  8
th_layer_out:  326706.5521259308
--------------------------------------------------------------------------------
6 95
conv2, Convolution
['pool1'], ['conv2']
bw_layer_in:  8
th_layer_in:  326706.5521259308
bw_layer_out:  8
th_layer_out:  645658.5820999146
--------------------------------------------------------------------------------
7 95
bn2, BatchNorm
['conv2'], ['bn2']
bw_layer_in:  8
th_layer_in:  645658.5820999146
bw_layer_out:  8
th_layer_out:  80895660.97851562
--------------------------------------------------------------------------------

And right before the error:

64 95
bn15, BatchNorm
['conv15'], ['bn15']
bw_layer_in:  8
th_layer_in:  2.271687002092953e+35
bw_layer_out:  8
th_layer_out:  2.7895644236416093e+37
--------------------------------------------------------------------------------
65 95
scale15, Scale
['bn15'], ['scale15']
bw_layer_in:  8
th_layer_in:  2.7895644236416093e+37
bw_layer_out:  8
th_layer_out:  4.709880642132973e+37
--------------------------------------------------------------------------------
66 95
relu15, ReLU
['scale15'], ['scale15']
--------------------------------------------------------------------------------
67 95
conv16, Convolution
['scale15'], ['conv16']
bw_layer_in:  8
th_layer_in:  4.709880642132973e+37
bw_layer_out:  8
th_layer_out:  3.603168268599782e+37
--------------------------------------------------------------------------------

Does anyone have any ideas as to what to do now and how to diagnose this issue? Thanks in advance!

bennihoffmann commented 5 years ago

In the error log i just read something from "flatten". But in your model file I coud not find any flatten layer.. Did you removed it?

We run into some truble with this type of layer during a conversion from tensorflow to caffe and finaly to chaidnn. In our case it tourned out, that we do not need any flatten layer at all in caffe...

Another thing I remember, that we had some truble with the last layer. In our case

layer { name: "linear22" type: "ReLU" bottom: "conv22" top: "conv22" relu_param { negative_slope: 1 } }

While experimenting we manual edit according to the VGG protex file from the ModelZoo (https://github.com/Xilinx/CHaiDNN/blob/master/docs/MODELZOO.md) to

layer { name: "linear22" type: "ReLU" bottom: "conv22" top: "linear22" relu_param { negative_slope: 1 } }

Thats just the things I remenber we did while experimenting with the tools. I do not have any deeper infomation about the XportDNN tool, therefore I do not know if this is something which could fix your problem as well.

DanielsJusts commented 5 years ago

Hello, @bennihoffmann! Thank you for your reply.

I do not have a "flatten" layer anywhere in the network, although I do have a "reshape" layer, but the error occurs before it reaches the "reshape" layer.

Thank you for sharing your experience about the problem you had with the last layer. Will keep this in mind in further work.

Kind of an off-topic question - I am working on TF to Caffe model conversion as well, this model is actually converted as well. How did you go about converting the model?

adu81020799 commented 5 years ago

I think the problem here is that your input file to the XportDNN, it accepts in the format given in the quantization guide which is {} format rather then [ ] format. You can try using the solution which i used for it .https://github.com/Xilinx/CHaiDNN/issues/90#issuecomment-433443329 let me know if that worked

DanielsJusts commented 5 years ago

Hello, @adu81020799! Thank you for your reply.

If I understood you correctly, then you think that I should change from this syntax:

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 32
    kernel_size: 3
    pad: 1
    bias_term: true
  }
}

to this:

layer [
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param [
    num_output: 32
    kernel_size: 3
    pad: 1
    bias_term: true
  ]
]

Did I understand you correctly? This doesn't work, XportDNN tool throws an error:

Traceback (most recent call last):
  File "XportDNN.py", line 400, in <module>
  File "XportDNN.py", line 355, in main
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 536, in Merge
    descriptor_pool=descriptor_pool)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 590, in MergeLines
    return parser.MergeLines(lines, message)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 623, in MergeLines
    self._ParseOrMerge(lines, message)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 638, in _ParseOrMerge
    self._MergeField(tokenizer, message)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 757, in _MergeField
    merger(tokenizer, message, field)
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 805, in _MergeMessageField
    tokenizer.Consume('{')
  File "/home/edi/Workspaces/SDx/CHaiDNN_repo/tools/chaidnn_tools_ENV/local/lib/python2.7/site-packages/google/protobuf/text_format.py", line 1101, in Consume
    raise self.ParseError('Expected "%s".' % token)
google.protobuf.text_format.ParseError: 3:2 : Expected "{".

Also, I cannot import the model with this syntax in PyCaffe as well.

You can try using the solution which i used for it .#90 (comment)

It seems that you had to convert a Darknet model to Caffe, but I have a TensorFlow checkpoint file, which I converted to Caffe, so I cannot use the method you used (to my knowledge).

adu81020799 commented 5 years ago

Hello, i think its other way round quoting from the Chai DNN repo: XportDNN expects the input/ data to be defined as a layer. layer { name: "data" type: "Input" top: "data" input_param { shape { #As required dim: 1 dim: 3 dim: 224
dim: 224 } } } I remember that my darknet model file was in the [] format, and I was getting the similar error. But When I converted into {} syntax then it the tool worked for me. I might be wrong (as I am a newbie to the Deep learning field). Regards Adarsh

DanielsJusts commented 5 years ago

@adu81020799, I think you are talking about neural network model descriptions in different frameworks. Is this the "[]" format you were talking about (for example):

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

This is an example of my networks first layer in Darknet model format. XportDNN tool expects models in format, which is used in Caffe (the one you showed or in "{}" format as you call it). I hope this clarifies this for you :)

adu81020799 commented 5 years ago

Ya. Chai Expects it in the caffe model. But If you would give Chai with the darknet model it would throw errors.

DanielsJusts commented 5 years ago

@adu81020799 and I have the model in the Caffe format and XportDNN tool is having issues with my model.

anilmartha commented 5 years ago

Hi @DanielsJusts,

The model you are using seems to have batch norm and scale layers. Currently, XportDNN does not support batchnorm and scale layers individually. You can fuse batchnorm and scale layers to convolution and create protoxt and caffe model. Then, run XportDNN with fused prototxt and fused caffemodel. CHaiDNN does not have support for ReLU with negative slope (Leaky ReLU). You could retrain your model with regular relu and try use it on CHaiDNN.

Xilinx / CHaiDNN

XportDNN tool having an error when trying to convert a model using Xilinx quantization method #95