KhronosGroup / NNEF-Tools

The NNEF Tools repository contains tools to generate and consume NNEF documents
https://www.khronos.org/nnef
222 stars 57 forks source link

export of quantized parameters by tensorflow #40

Closed jnorwood closed 5 years ago

jnorwood commented 6 years ago

The binary file support for quantized values, described in the spec, looks pretty good, and I see handling of quantization in the nnef tensorflow exporter.

https://github.com/KhronosGroup/NNEF-Tools/blob/master/converter/tensorflow/src/tf2nnef.py

However, the sample doesn't cover the case for quantized data. https://github.com/KhronosGroup/NNEF-Tools/blob/master/converter/tensorflow/src/sample_export.py

I'm wondering if there are any additional steps required to specify the export format for quantized data, or if there are any built-in limitations for that type of export, since I will need to be using this soon. Thanks.

gyenesvi commented 6 years ago

The export of quantized data from TensorFlow is not yet well developed. Storage of quantization data has been reworked in the final version of NNEF since the provisional, furthermore, publicly available pretrained models in TensorFlow with quantization are still not plenty, actual test networks would be helpful. If you are aware of such publicly accessible networks, it would be great to know!

How would you train your network in TF such that it is quantized? We are experimenting with tf.contrib.quantize.create_training_graph, we'll try to add functionality to the exporter such that the resulting graph could be exported in a quantized format. Let us know if you have something similar in mind.

jnorwood commented 6 years ago

The tensorflow documentation below shows a quantize_graph tool.
https://www.tensorflow.org/versions/r1.0/performance/quantization

There are a couple of related papers below. https://arxiv.org/pdf/1806.08342.pdf https://arxiv.org/pdf/1712.05877.pdf

Tensorflow has links to MobileNet models with quantization at this page. https://www.tensorflow.org/hub/modules/image

jnorwood commented 6 years ago

This blog has some info on use of the quantize_graph tool.
https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

The quantize_graph tool is in this folder. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize/python

jnorwood commented 6 years ago

Another link that might be useful is this from the tensorflow site: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize I'm reading through their doc of the quantization that is provided there (also in one of my prior posts) https://arxiv.org/pdf/1712.05877.pdf

I'm curious, since you mentioned a new spec, if its quantization persistence format will include some documentation to relate it to the tensorflow lite format's scaling, zero offset and bias parameters, described in the pdf document. They specify an exact integer offset to the zero value, and the range is implied by the number of bits and the scaling factor. They state this is also used in the gemmlowp library.

jnorwood commented 6 years ago

This example for gemmlowp is very simple, and provides generation and quantization of uint8 test values. It seems to me it could be modified to use as a test program to read and write binary data using your nnef format, to make sure it is compatible with gemmlowp needs. I made a .sln file for it to use in msvc, and the only mods were adding a preprocessor search path to to the source folder and defining NOMINMAX=1, which was to keep msvc from using their own min and max macros, which interferes with std::min and std::max used in the example. https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

gyenesvi commented 6 years ago

Thanks for the references, we will look into it as soon as possible.

jnorwood commented 6 years ago

This article describes tests of per channel quantization https://arxiv.org/pdf/1806.08342.pdf

Both per-layer and perchannel quantization allow for efficient dot product and convolution implementation as the quantizer parameters are fixed per kernel in both cases.

The gemmlowp code handles per channel quantization in GemmWithOutputPipelinePC , which is documented in https://github.com/google/gemmlowp/blob/master/doc/public.md

So, for gemmlowp per-channel quantization, operations on input tensors with n channels would require filter parameters with n scaling values and n offset values.

Any thoughts on how support for per-channel quantization parameters might be implemented? The spec examples don't mention it.

jnorwood commented 6 years ago

I noticed that pytorch developers have a couple of sites that attempt to provide quantization support: https://github.com/aaron-xichen/pytorch-playground https://github.com/eladhoffer/quantized.pytorch https://arxiv.org/pdf/1805.11046.pdf the article associated with the second link mentions use of the gemmlowp quantization format.

jnorwood commented 6 years ago

This c++ code on tensorflow also provides support for quantize operations. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms

The controls for training are handled using extended options and operations in the graph. It isn't clear to me if the NNEF Tools are intended to provide support for these type of training and retraining options. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#quantize_weights

To ensure that outliers don't increase the range too much, and so decrease the accuracy by putting too many bits into rare extreme values, the min_percentile and max_percentile arguments control how the overall min and max are chosen.

jnorwood commented 6 years ago

This Intel Distiller program is open source now, and provides quantization and pruning optimizations. They show export to ONNX, but as you can see by the issue, the ONNX support is not well implemented. So, if you get something working well with NNEF, you might consider posting a note on their site. https://nervanasystems.github.io/distiller/usage/index.html https://github.com/NervanaSystems/distiller/issues/23

jnorwood commented 5 years ago

This Netron viewer works very well for viewing the quantized data in the tensorflow flatbuffer files. https://github.com/lutzroeder/netron Also, I don't know if these quantized files were available last time I posted, but this link has very complete info. You can use the netron viewer to see the scaling and offset values for the uint8 weights, as well as the downscale multiplier for the accumulated convolution results. The only thing that isn't evident to me is where to get the values to precondition the input image data. https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

jnorwood commented 5 years ago

I've generated a graph.nnef and graph.quant file from the MobileNet_v1_1.0_224_quant file at https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

I've run the parser with --quant

A couple of things I've encountered.

  1. the name .quant suffix file expected must have the same name as the .nnef graph file.
  2. The min and max values require a decimal point in the values ... otherwise rejects them as integer when expecting float.
  3. The parser succeeds, once those changes were made, but I don't see any difference in the flattened output. Should I see some difference as a result of the insertion of the linear_quantize fragment, or are these quantization parameters being somehow associated with the variables, activations, weights in the internal parser data?
gyenesvi commented 5 years ago

Section 5.1 of the spec (https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.html#container-structure) says that the quant file must be named graph.quant, and the main structure file must be graph.nnef, so naturally it must be the same name, that's why the tool expects it that way.

In NNEF, everywhere in the syntax, not just in the quant file, integers are not automatically cast to floats, and arguments must match the type of parameters, no implicit casting is allowed. So yes, a decimal point is required always.

The parser (and the tool) does not do anything with the quantization info except that it parses it and checks it for validity, and returns the quantization info to the caller of the parser. It's the job of the user code to do something with it, for example map operations to actual quantized operations in the backend. The info file only contains quantization algorithm parameters that tell you how to do it, for example quantization ranges. The quant file associates ranges with the activations, and from that you can decide how to parameterise your quantised operations.

jnorwood commented 5 years ago

The valid input for the quantization file is confusing. In one place it says the Identifiers must be exported activation tensor names (not variable names) In another place it shows the example below, with identifiers, filter1 and bias1, that I would normally associate with variable assignments.

"input": linear_quantize(min = 0.0, max = 1.0, bits = 8); "filter1": linear_quantize(min = -1.0, max = 1.0, bits = 8); "bias1": linear_quantize(min = -1.0, max = 1.0, bits = 8); "conv1": linear_quantize(min = 0.0, max = 1.0, bits = 8); "filter2": linear_quantize(min = -1.0, max = 1.0, bits = 8); "bias2": linear_quantize(min = -1.0, max = 1.0, bits = 8); "conv2": linear_quantize(min = 0.0, max = 1.0, bits = 8);

I used both activation and weight variable names in the .quant file, as in the above lines, and the parser completed successfully.

gyenesvi commented 5 years ago

The spec says identifiers must be tensor names/identifiers, not variable labels (here: https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.html#quantization-format). This section explains the difference between the two: https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.html#exported-ids, furhtermore, that variables can be referenced by two mechanisms (both tensor names and variable labels), and the quant file must contain the tensor names. So it's okay to use the variable tensor identifier in the quant file.

jnorwood commented 5 years ago

There are two quantization operations defined here: https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.html#quantization-operations linear_quantize and logarithmic_quantize Are these the only two fragments that can be used in the graph.quant file, or is it possible to use user defined fragments of new types of quantization operations, and use those in the graph.quant file? That would be useful for my case, since I'm getting a bunch of quantization parameters from tensorflow lite that are in the form of a float scaling constant and a zero offset, and it would be extra work to derive them again.

gyenesvi commented 5 years ago

Those are the only two predefined ones, but as you say, you can write your own ones as compound operations and use them in the quant file. Again, the parser only checks the validity, the user code has to map it to quantised ops.

A note on parsing custom defined quantization ops: the parsing of the quant file must happen after parsing the custom fragments, but before parsing the graph. In the current tool (C++), it happens in the beginGraph method of the parser callback, where the custom fragments are already known. The Python interface is simpler, that one hides these details and you just get back the whole graph, along with the quantization info.

jnorwood commented 5 years ago

as an example of the use of the tflite scaling values, see GetQuantizedConvolutionMultipler in this code. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/kernels/kernel_util.cc

They are using input, output and filter scaling values to compute the downscale multiplier for the layer. Later they further convert that float value to integer mpy and rightshiftround values. I'd probably want to compute the integer downscale values, either at parse time, or while creating the .quant files. In either case it would be easier to start with the scaling values rather than having to derive those again. That's where it would be handy to support some user defined fragments for additional quantization parameters. I know we can add these parameters in the binary tensor files, but it becomes a bit messy to process those during parsing.

jnorwood commented 5 years ago

ok, thanks. The spec could use an update to make it clear that you can reference user defined fragments in the quant file. That solves my immediate problem.

gyenesvi commented 5 years ago

Yes, it seems like a good idea to create a new quantization fragment for this with all the required parameters and store those in the quant file. Hopefully, the variables can be stored in a quantised format in the binary file using the existing linear quantization algo, while the quantization info will carry further parameters.

Sure, I'll add that to the spec.

jnorwood commented 5 years ago

I wrote some code to dump a quantized mobilenetv1 tflite file to nnef format, plus the export of the quantization data to nnef and a quant file with all the quantization parameters. I have permission to upload it, if you want it. Some of the processing is from the armnn code, but I cut out the boost dependencies ... so it is fairly lite. It just requires flatbuffers 1.8 and the tflite generated schema and one source file.

jnorwood commented 5 years ago

I'm attaching the generated text nnef and quant files. These were generated from the tflite MobileNet_v1_1.0_224 at https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

nnef_mobilenetv1.zip

.

gyenesvi commented 5 years ago

Thanks for the work on tflite conversion. From mentioning boost dependencies, I suspect that the code is in C++. If I understand it right, the binary data itself is not yet converted, that's why the zip does not yet contain it.

We are currently working on reorganizing the code that we have already. Most converter code is in Python, so that it's easier to distribute and use in all platforms. We do plan to write a tflite converter, but all help is welcome. For all converters, we want to provide some base code which makes it easy to add a new converter, and to let all converters have a similar interface.

However, we have to be careful with dependencies, to avoid any licensing issues, so dependency-free code is preferred. In any case, the way to provide contributions would be to fork the repo, add new functionality and then create merge requests that can be tested and reviewed before merging in to the Khronos repo. Let me know if this could work for you.

jnorwood commented 5 years ago

I started with some armnn code from github, but removed boost dependencies and any use of the armnn desciptors. I just used their code to understand how they were using the tflite data. The only outside dependencies now are the flatbuffer 1.80, since their binary is based on a google flatbuffer schema.

I'm not modifying any of your code. I'm generating nnef code from the tflite flatbuffer network description.

Yes, it is all c++ code. It generates a graph.nnef, a graph.quant, and the nnef format binaries for weights and biases in the directory format and tensor shapes used by nnef.

I did generate some quantization data for the weight and bias files, and inserted it there, but I ended up only using the .quant file, where I duplicated the same info, since I needed it for code generation.

I only inserted processing for the tflite operations that were needed for MobilenetV1. The other operations are defined, but just return some unimplemented code. I think you'd find this useful to understand the tflite conversion. I've verified the MobilenetV1 model generation and operation.

I have permission to upload this converter.

On Mon, October 29, 2018 05:10, Viktor Gyenes wrote: Thanks for the work on tflite conversion. From mentioning boost dependencies, I suspect that the code is in C++. If I understand it right, the binary data itself is not yet converted, that's why the zip does not yet contain it.

We are currently working on reorganizing the code that we have already. Most converter code is in Python, so that it's easier to distribute and use in all platforms. We do plan to write a tflite converter, but all help is welcome. For all converters, we want to provide some base code which makes it easy to add a new converter, and to let all converters have a similar interface.

However, we have to be careful with dependencies, to avoid any licensing issues, so dependency-free code is preferred. In any case, the way to provide contributions would be to fork the repo, add new functionality and then create merge requests that can be tested and reviewed before merging in to the Khronos repo. Let me know if this could work for you.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/KhronosGroup/NNEF-Tools/issues/40#issuecomment-433854701

jnorwood commented 5 years ago

to answer your question more specifically, yes, I do generate the binaries. I didn't upload them because they are relatively large. I can upload them if you need them, but you can also just run the converter on the MobilenetV1 tflite file to extract them.

jnorwood commented 5 years ago

binaries are attached. I see you had a recent fix to the binary format. I don't know if I included that fix in this generation, so may MobilenetV1.zip

jnorwood commented 5 years ago

the tensorflow lite code for GetQuantizedConvolutionMultipler shows use of doubles to calculate the downscale constant. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/kernels/kernel_util.cc

They later use it to calculate equivalent multiplier and shift values used in int64 operations in SaturatingRoundingDoublingHighMul and RoundingDivideByPOT in this file https://github.com/amutu/tensorflow_third_party/blob/master/gemmlowp/fixedpoint/fixedpoint.h

My own code is currently multipying by an fp32 downscale constant, but I believe the next version will have to compute the int32 multiplier and right shift constants, as in the code above, but at code generation time rather than at runtime. That's where I'll want to use the double scalars in the nnef expressions.

gyenesvi commented 5 years ago

See my notes on the other thread about the calculations requiring doubles.

About the converter code, it would be helpful to be able to review and build on your code. If I understand it right, it builds on the C++ parser only. Would it be possible to upload that to a separate repo under your name? It can be either a fork of the Khronos repo, or not, it does not matter.

However, to post it along with the other officially supported converters, we would prefer rounding it up:

We are working on a conversion pipeline that would make any conversion a simple 3-step process:

Khronos would provide a common frame for converters to ease the writing of new converters. Reading and writing NNEF format would be common code to all converters, so a new converter would require writing the read/write module of the other format (TfLite in this case), and the actual conversion step in memory. Furthermore, our hope is that the actual conversion step is almost the same as for regular TF, so that step does not need to be written again specifically for TfLite.

Let me know if you would be interested in contributing to such a more elaborate version of your converter.

jnorwood commented 5 years ago

This tflite post training quantization option is relatively new. I ran across it today. Looks like it appeared around Sept 13. Claims to quantize weights for a trained model without requiring the separate quantization aware training. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/tutorials/post_training_quant.ipynb

gyenesvi commented 5 years ago

Having a more thorough look at the NNEF files you attached, the parameterization of the tflite_quantize fragment is not clear to me. First, the scale parameter together with the min and max parameters seems to be redundant, since as far as I understand from gemmlowp quantization

scale = (max - min) / (2^bits - 1)

Furthermore, what is the purpose of the downscale parameter being stored? I thought that's also something that can be calculated from the scale values (in compile time).

I had the impression that you want to reparameterize the quantization op in terms of scale and zero_point as in gemmlowp (and as in TfLite) so that you can store those params directly in NNEF.

The zero_point can be calculated roughly as (some further clamping is required):

zero_point = round(- min / scale)

See ChooseQuantizationParams here: https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

So if you have this parameterization in TfLite, then you can store these parameters in the NNEF quant files, and even store the quantized weights in the binary using min and max values recalcualted from scale and zero_point params. This way the data in the binary and in the quant file would be consistent, and you would have a gemmlowp parameterization info readily available from the quant file.

gyenesvi commented 5 years ago

If I am right, your downscale is the calculated value (input_scale * filter_scale / output_scale) in the gemmlowp implementation. However, I think it would be misleading to store that for the output tensor, since it corresponds to the operation (matmul or conv) and not the tensor itself. So it does not play a role in storing/interpreting the value of a tensor, but in how it's value is calculated, and since the downscale value can be derived from the corresponding scale params of the 3 tensors involved in a matmul or conv (input, filter, output) as above, it is needless to store it.

jnorwood commented 5 years ago

yes, that's right, I'm currently using the fp32 downscale constant. Each layer converts the data back from an int32 accumulator to a quantized uint8 at a pre-determined scale that was determined during training. The downscale operation can be just a single fp32 multiply or can be an int64 mpy and right shift round. The tflite code computes the downscale value in double, then uses that double to compute the int32 multiplier and right shift values for the integer version of the operation.

Yes, I stored the quantization info in the binaries with the tensors initially, until I understood that the quant file does everything I needed. I'll upload the source code today. I'm trying to decide whether I want to pick up a recent change in tflite. They recently updated their genereated schema to be compatible with the 1.9 flatbuffer. I've been working with the 1.8 flatbuffer. I'll go ahead and check this into a contrib folder in my nnef branch on github and pick up the flatbuffer update when I've had more time to check it. It means my generated schema will be the one compatible with flatbuffer 1.8.

jnorwood commented 5 years ago

On Tue, October 30, 2018 08:43, Viktor Gyenes wrote: See my notes on the other thread about the calculations requiring doubles.

About the converter code, it would be helpful to be able to review and build on your code. If I understand it right, it builds on the C++ parser only. Would it be possible to upload that to a separate repo under your name? It can be either a fork of the Khronos repo, or not, it does not matter.

However, to post it along with the other officially supported converters, we would prefer rounding it up:

We are working on a conversion pipeline that would make any conversion a simple 3-step process:

Khronos would provide a common frame for converters to ease the writing of new converters. Reading and writing NNEF format would be common code to all converters, so a new converter would require writing the read/write module of the other format (TfLite in this case), and the actual conversion step in memory. Furthermore, our hope is that the actual conversion step is almost the same as for regular TF, so that step does not need to be written again specifically for TfLite.

Let me know if you would be interested in contributing to such a more elaborate version of your converter.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/KhronosGroup/NNEF-Tools/issues/40#issuecomment-434305402

The code has a cmake build, so shouldn't be hard to convert. This particular code doesn't build on the parser. It generates input for the parser... the graph.nnef and graph.quant files and the binary weight and bias files, extracted from a quantized tflite file.

Yes, currently only the operations from a single quantized MobilenetV1 have been implemented. The other tflite operations go through a switch to a single NotImplemented function.

Yes, this is only a single direction. Going back to tflite might be a good idea, though. Tensorflow comments indicate they like the flatbuffer format, and other formats seem to be lagging in the support for quantized data. I found the flatbuffers to be very easy to work with. It loaded without a problem in both python and in a c++ tflite development environment.

jnorwood commented 5 years ago

On Tue, October 30, 2018 10:50, Viktor Gyenes wrote: Having a more thorough look at the NNEF files you attached, the parameterization of the tflite_quantize fragment is not clear to me. First, the scale parameter together with the min and max parameters seems to be redundant, since as far as I understand from gemmlowp quantization

scale = (max - min) / (2^bits - 1)

Furthermore, what is the purpose of the downscale parameter being stored? I thought that's also something that can be calculated from the scale values (in compile time).

I had the impression that you want to reparameterize the quantization op in terms of scale and zero_point as in gemmlowp (and as in TfLite) so that you can store those params directly in NNEF.

The zero_point can be calculated roughly as (some further clamping is required):

zero_point = round(- min / scale)

See ChooseQuantizationParams here: https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

So if you have this parameterization in TfLite, then you can store these parameters in the NNEF quant files, and even store the quantized weights in the binary using min and max values recalcualted from scale and zero_point params. This way the data in the binary and in the quant file would be consistent, and you would have a gemmlowp parameterization info readily available from the quant file.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/KhronosGroup/NNEF-Tools/issues/40#issuecomment-434354365

The gemmlowp does a fudge of the zero point so that it lands on an integer. I want that computed zero point as a constant rather than computing it at runtime. The zero points are used outside the inner loop of the convolution, so that the inner loop is all uint8 multiplies, accumulated to an int32 accumulator.

They also provide the bias scale in the tflite fie, but it is equal to input_scale * kernel_scale, and I don't believe it is used anywhere, other than as a sanity check.

The downscale multiplier is definitely used at runtime to convert from int32 accumulator to quantized uint8 data with the scale expected by the input of the following layer.

gyenesvi commented 5 years ago

I am not saying you should calculate the scale and zero point from the min and max (rather the other way round), so you can use what has been calculated in TF. What I am proposing is that you should export the params as they are from TfLite to NNEF by defining the fragment:

fragment gemmlowp_quantize( input: tensor<scalar>, scale: scalar, zero_point: integer, bits: integer ) 
-> ( output: tensor<scalar> )
{
    min = scalar(-zero_point) * scale;
    max = scalar((2 ^ bits - 1) - zero_point) * scale;
    output = linear_quantize(input, min, max, bits=bits);
}

Then you can store the params exactly as they come from the TfLite file into the quant file, and in the binary, you just set linear quantization with min and max calculated as above. This establishes an exact correspondence between gemmlowp params and the parameterisation of NNEF.

When you process such an NNEF file, you can calculate the downscale value by multiplying the scale param for the input and filter tensors and dividing by that of the output tensor: downscale = (input_scale * filter_scale / output_scale). You can do this calculation in compile time in double, so you reproduce the TfLite calculation exactly.

jnorwood commented 5 years ago

ok, I uploaded the tflite_to_nnef conversion source code. It only has the ops from quantized inference mobilenetv1. It is currently only tested on msvc, but I'll pull it down now and try it on ubuntu 18.04 https://github.com/jnorwood/NNEF-Tools/tree/master/contrib/converters/tflite_converters/tflite_to_nnef

jnorwood commented 5 years ago

I unfortunately used <filesystem> in the msvc implementation, since I was trying to get away from the boost dependency in the armnn code. Looks like gcc may have some partial support ....

gyenesvi commented 5 years ago

Thanks for the code, as you see, it can be cumbersome to develop even simple tools like this in C++ in a platform independent manner. Looking at your code, it seems quite monolithic, without separation of flatbuffer input reading, actual conversion and NNEF output writing. When we integrate TfLite support to our tools, we would prefer to have those phases clearly separated. Nevertheless, your code can be helpful to see the requirements for TfLite conversion.

What is your opinion about the quantization fragment I proposed above? Is it clear how it would avoid the problem of using doubles in the NNEF parser?

jnorwood commented 5 years ago

The quantization fragment you show looks fine. The c++17 <filesystem> seemed like a lighter weight option than using boost ::filesystem. That appears to be the only snag. I see comments on-line that filesystem is available as experimental in gcc 7.x, and some announcement in gcc 8.x for more support, so it might still be preferable if you don't have an alternative. I believe I was just using it to create the subdirectories for the binary files from the path strings.

jnorwood commented 5 years ago

The most recent tflite generated schema on the tensorflow site has been recently updated to use the most recent flatbuffer. I haven't picked up that change yet, so I download the compatible flatbuffer version in the script.

jnorwood commented 5 years ago

I made some mods to main.cpp so that g++-8 can build the tflite_to_nnef in ubuntu. The command line to build it is in the readme. I haven't updated the cmake configuration file for linux. The changes for the linux filesystem use of filesystem are from this example, and the build line is also similar to the one given in this example. https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/1792570 https://github.com/jnorwood/NNEF-Tools/tree/master/contrib/converters/tflite_converters/tflite_to_nnef

jnorwood commented 5 years ago
Furthermore, our hope is that the actual
conversion step is almost the same as for regular TF, so that step does
not need to be written again specifically for TfLite.

In response to that ... tflite is a bit odd. I question whether they had a common origin. Looks to me more like they are in the middle of resolving differences. I like the flatbuffer vs protobuf though. https://www.tensorflow.org/lite/tf_ops_compatibility

Also ... the c++ code loads and exports that model very fast, while just loading that tflite code into python takes several seconds.

Much of the converter is hacked up from the armnn tflite parser. I was basically just stopping in the operations and looking at the data structures in the debugger to understand what I needed to extract to get to the nnef code. I removed the boost dependencies, which included all their error checks, and removed the code that was associated with their graph definitions. You might take a look at their code if you have some way to better replace that error handling. https://github.com/ARM-software/armnn/blob/master/src/armnnTfLiteParser/TfLiteParser.cpp

jnorwood commented 5 years ago

Let me know if you would be interested in contributing to such a more elaborate version of your converter.

I can probably help validate some of the generated models from your list, since I also need to implement the same ones.

How do you validate your NNEF graphs now?

jnorwood commented 5 years ago
We are working on a conversion pipeline that would make any conversion a simple 3-step process:

read in the source format to an in-memory graph representation
do the conversion on the in-memory graph
write out the resulting graph to the target format

The tflite flatbuffer struct can be accessed pretty directly from c++. It isn't packed or serialized in any unusual way. https://google.github.io/flatbuffers/ It seems very light for input.

The onnx protobuf format is being supported by facebook Glow and by Intel's nGraph and distiller programs. If you want to support their type of transformations/optimizations it seems like a conversion between NNEF and ONNX would be useful. I may be able to help on the ONNX conversion, since it would help us in evaluating these graph optimization tools.

The intel mkldnn keeps associates a tag with their tensors https://github.com/intel/mkl-dnn/blob/master/include/mkldnn_types.h mkldnn_memory_format_t which seems to me to be very helpful in keeping track of what you are processing. I found it very tedious to dig up the required info, so whatever you come up with, it would help if you could somehow tag the tensors with something similar.

gyenesvi commented 5 years ago

Thanks for the link about TFLite and TF compatibility, it is a useful one. For the purpose of offline conversion, speed is not an issue, so it is not a problem that the Python code is slower than the C++ for flatbuffers, portability is of more value for us.

We typically validate our converters by converting to and back from NNEF, and comparing the two models. So, we would start from a TFLite model, convert to NNEF, then back to TFLite and then check if the two TFLite models do the same calculations by executing them in TFLite. I will let you know if we have something to be validated.

A converter between ONNX and NNEF would probably be quite useful, and we'd like that to be implemented as part of the Python toolset of converters that we are working on.

jnorwood commented 5 years ago

the pytorch glow app supports quantization. The onnx persistence format did not yet support it, but they (facebook) provided persistence support for quantization using caffe2. There is a caffe2 quantized resnet50 example that I don't believe is yet available with tflite. https://github.com/pytorch/glow/blob/master/docs/Quantization.md https://github.com/pytorch/glow/commit/9f706f3298dbe46e2d212df81523b5522b4bd44c https://github.com/caffe2/models/tree/master/resnet50_quantized

jnorwood commented 5 years ago

I'm attaching a quantized MobilenetV2 in nnef format that I converted from tflite, similar to what I did with mobilenet V1. It included the ui8 quantization parameters.

It is converted from the tflite Mobilenet_V2_1.0_224_quant from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/models.md#image-classification-quantized-models

Their add operation, used to join the residual bypass, is the only thing really new vs V1. It has to handle different input scaling for the two inputs. It is in their add.cc file. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/add.cc MobilenetV2_224_quant.zip

jnorwood commented 5 years ago

I pushed the changes to the tflite_to_nnef conversion to add the MobilenetV2 conversion support. It simply adds the ParseAdd. https://github.com/jnorwood/NNEF-Tools/tree/master/contrib/converters/tflite_converters/tflite_to_nnef

jnorwood commented 5 years ago

I removed the reliance on std::filesystem. I also updated the cmake file so that it downloads the dependencies, builds on ubuntu, runs tests of the conversion for mobilenet v1 and v2.

https://github.com/jnorwood/NNEF-Tools/blob/master/contrib/converters/tflite_converters/tflite_to_nnef/CMakeLists.txt

jnorwood commented 5 years ago

I updated the tflite_to_nnef program to convert the four quantized inception networks from tflite. I've tested that the resulting graphs pass the parser, but haven't checked the networks operate yet. I'll be checking soon.