KhronosGroup / NNEF-Tools

The NNEF Tools repository contains tools to generate and consume NNEF documents
https://www.khronos.org/nnef
222 stars 57 forks source link

use of float vs double for scalar values #51

Closed jnorwood closed 6 years ago

jnorwood commented 6 years ago

value.h has typedef float scalar_t;

Wouldn't it be preferable to keep everything double while expressions are being evaluated, and let the user round to float as needed?

The reason I'm bringing it up is that I noticed that the tflite quantized downscale integer operations are 64 bit. They compute a 32 bit multiplier and downshift values from double values that are scaling constants for input, output and kernel in GetQuantizedConvolutionMultipler.

For example, one of the mpy and downshift pairs from mobilenetv1 is: mpy=2136047634,shft=7

So, it appears to me that I would need to use double expressions to duplicate their calculations and provide integer downscale multiplier values.

gyenesvi commented 6 years ago

Well, the reason scalar_t is float is that in the deep learning world, typically everything is float32, or even float16, so there seemed to be no reason for doubles. The range of numbers is typically quite small compared to the full range of floats, on the order of a few hundreds maybe, and it's quite surprising for me that high precision would be required for quantization, which is all about loosing precision. I am not yet familiar with the parameters involved in tflite, could you elaborate further on what those numbers mean, and why such big numbers are involved? Is the value mpy=2136047634 an integer? It seems to be quite close to int max..

In any case, if you really need it, you are free to replace float to double in your copy of the code, but I'd prefer to understand if there is really a need for it.

jnorwood commented 6 years ago

Those numbers are the integer constants associated with a downscale conversion from int32 convolution accumulator to the uint8 quantized volume data that moves between layers. The multiplier scales up the 32 bit signed (int32) accumulator to int64, and the shift value is used to right shift that value so that the lower 8 bits of the upper 32 bits of the 64 has the uint8 value. They go further to saturate to 0..ff range if it is still out of range.

So, in order to create those integer operations, they use a ratio of three double precision values ... the layer input scaling constant, the layer output scaling constant, and the kernel scaling constant. By scaling constant, I'm referring to the quantization scaling constant used by tflite and gemmlowp for their uint8 quantization.

In my case, I'm using a float32 downscaling operation in the target, but I want to be able to match the precision that they used as closely as possible so I can compare vs their reference code. In order to duplicate their values, I would need to use their double precision ratio of the scaling constants to compute the downscale constant.

On Mon, October 29, 2018 04:14, Viktor Gyenes wrote: Well, the reason scalar_t is float is that in the deep learning world, typically everything is float32, or even float16, so there seemed to be no reason for doubles. The range of numbers is typically quite small compared to the full range of floats, on the order of a few hundreds maybe, and it's quite surprising for me that high precision would be required for quantization, which is all about loosing precision. I am not yet familiar with the parameters involved in tflite, could you elaborate further on what those numbers mean, and why such big numbers are involved? Is the value mpy=2136047634 an integer? It seems to be quite close to in max..

In any case, if you really need it, you are free to replace float to double in your copy of the code, but I'd prefer to understand if there is really a need for it.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/KhronosGroup/NNEF-Tools/issues/51#issuecomment-433837838

gyenesvi commented 6 years ago

Thanks for the more detailed description. From what you describe, those int64 and double values seem to be internal affairs of the quantizer implementation. What is still not clear to me is what values you want to store as doubles in the parser implementation that are not enough as floats? Because the ones in your example above are integers, and I guess the actual implementation of tensors in most devices are floats, so even if your intermediate calculation is double somewhere, in the end it may get clamped to float. Or does tflite use doubles for tensors? That would be surprising for me given that it's designed for mobile..

Have you done any experiments if this actually has an effect on precision? It would be good to know that.

jnorwood commented 6 years ago

I'm currently passing the three float scaling constants through the .quant file to the parser. In the parser I read in those three values, which were all doubles in the tflite GetQuantizedConvolutionMultipler computation. In the parser expression evaluator I compute a float32 downscale constant from three float32 scaling constants, instead of all these being double values. In the tflite code they further convert the single double downscale constant to the int32 downscale multiplier and int32 right shift power of 2 divider. Currently I'm using the fp32 value for the downscale and I have done exact comparison with the tflite output from each layer, by modifying the tflite c++ operations to do printf output in each layer before and after all the accumulate, biasadd, and downscale operations. I do see minor differences in the results, but the category selections pass for the test cases I've used. I haven't converted to their int32 downscale operations, but I'm fairly certain the fp32 values can not match since their int32 downscale multiplier constants are between 30 and 31 bits values. I have only tested a couple of images from the imagenet dataset currently. I believe they are in the image subdirectory that I uploaded yesterday. I can upload the per layer comparison data if you need it.

gyenesvi commented 6 years ago

Okay, I have looked at your posts on the other thread, and things are getting more clear. Two notes:

So the actual scale data is stored in floats, so it seems sufficient to do so in the NNEF parser as well.

Only the function GetQuantizedConvolutionMultipler uses double for the resulting scale, which is the multiple of the input and weight scales, divided by the output scale, maybe more precision is required for that calculation (but I suspect that the final result will be converted to float at some point because TfLite represents the scalar and all tensors as a float anyway). However, that calculation is not really the task of the parser. By default, the parser just reads in the operations for you and you continue with it however you want. It's true that you can use the evaluation capability of the parser to do that calculation for you while it decomposes the operation for you, however it's not designed for that, and if that's insufficient for your purposes, you could just handle that operation as an atomic one, and do that calculation for yourself in C code, using doubles. So in your converter, you could handle your custom op as an atomic one, and write the translation code starting from there. Does that seem to be a viable path?

jnorwood commented 6 years ago

I can probably work around any flt32 issues I have, or else just change the scalar definition to be double. I think it is probably useful if I'm trying to exactly match the tflite convolution data, but may not make any meaningful difference in the image category results.

jnorwood commented 6 years ago

The downscale multiplier and shift constants are converted to int32 values that are used in an int64 downscale operation in the tflite code. I believe I've already sent the relevant links. I don't know of anywhere they store an fp32 downscale constant. It isn't precomputed in their tflite file. I'm storing in int the quant file.

gyenesvi commented 6 years ago

I did not say they store this value in float. They store 3 scale values in float, from which they calculate this value in double, and you should do the same instead of storing this precomputed value in the quant file. See the other thread for more details on how I believe params should be stored in NNEF.