Closed arvoelke closed 4 years ago
@arvoelke Thanks for the question. Not sure I quite follow the fixed point notation you have used. Representing A(-128 , 127)/2^8 requires 8(fractional) + 1(sign) bit => 9 bits. So, how would you represent that in q7 ? https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page is the closest how-to page that I could find for those legacy APIs(https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis)
Thanks for getting back to me.
Representing A(-128 , 127)/2^8 requires 8(fractional) + 1(sign) bit => 9 bits. So, how would you represent that in q7 ?
Basically the same as usual except you're missing the LSB (the 8'th fractional bit) since you can only store 7 of the 8 fractional bits. This is what's done here https://github.com/ARM-software/ML-KWS-for-MCU/blob/master/Deployment/Quant_guide.md (see "and the quantized biases have 8 bits for fractional point (i.e. range -0.5,0.5)" while they are using q7 to quantize the biases). This is from one of the papers in the first link you sent.
Everything works out the same as long as you shift things properly. The problem occurs when a negative shift is needed by the API, such as in my original post. From what I understand, the work-around I suggested is correct, but I'd like to check that I'm not overlooking some other possibility (i.e., an approach that is faster, more accurate, or more direct).
for those legacy APIs (https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis)
I'm surprised, this is the first place I've seen that has said anything to the effect of the *_q7
and *_q15
CMSIS-NN kernels being a legacy API. None of the documentation or various examples I've found hosted by ARM have said this (not even the first link you sent). Supposing I am using q7, is there an advantage in switching to the new API when the legacy API can shift and saturate in one instruction?
Will the legacy API soon be deprecated? Is there any more information on the new API (e.g., benchmarking against the legacy API with x4/opt)?
@arvoelke Thanks for the clarification. The workaround that you suggested in using the q15 API, sounds reasonable. Having looked at the implementation in arm_shift_q7(), my guess is that converting q7 to q15 and using arm_nn_activations_direct_q15() would be faster. Both aren't direct though. Can't think of a more direct one.
All of our current development(Post May 2019) is based on Tensor Flow Lite's symmetric quantization specification. The examples that you have seen are quite likely made before that. The link I sent was around the April 2019 time frame though. The main advantage is that you can directly use a TFL model with the CMSIS-NN APIs. Despite the additional cycles that you pointed out for requanatization, improvements are done in other areas. Both in terms of cycles and memory used. You'll notice that most of the common operators(DW conv, 1x1 conv, fully connected) do not use additional scratch buffer for optimization in the TFLu compatible versions.
There aren't any plans to deprecate the legacy APIs. Just bug fixes are expected to be done. As for benchmarking, because of the differences in what the APIs handle and the subsequent differences in the model used, it isn't a fair comparison to make. So, unfortunately there isn't a blog post that I could direct you for that.
What framework does your model that you are using come from?
Here is information about using the new APIs
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/kernels/cmsis-nn
Great thanks for all the details.
What framework does your model that you are using come from?
We're using Nengo (https://nengo.ai). You can find information on an early iteration targetting ARM here: https://ieeexplore.ieee.org/document/7280390. Since then, several advancements have been made to Nengo to enable optimizing both spiking and non-spiking neural networks (and hybrids) for energy-efficiency on ARM cores using backpropagation (among other techniques).
Documentation: https://arm-software.github.io/CMSIS_5/NN/html/group__Acti.html
Suppose my
q7
data represents aQm.n
number of the formQ(-1).8
, i.e., representingA/(2^8)
whereA
is the integer value of the data. This would be representing values in the range[-128/256, 127/256] = [-0.5, 0.49609375]
. Is there a way to usearm_nn_activations_direct_q7
on such data, since theint_width = -1
is supposed to beunsigned
?Could I call
arm_shift_q7
to right-shift by 1, callarm_nn_activations_direct_q7
withint_width = 0
, and then left-shift the result by 1 to get back toQ(-1).8
format? Or would it be faster or more accurate to callarm_nn_activations_direct_q15
withint_width = 7
by interpreting the data asq15
in the form ofQ7.8
? Is there some example code for doing this? Thanks in advance!