ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.34k stars 1.09k forks source link

Is it possible to use arm_nn_activations_direct_q7 with signed int_width? #892

Closed arvoelke closed 4 years ago

arvoelke commented 4 years ago

Documentation: https://arm-software.github.io/CMSIS_5/NN/html/group__Acti.html

Suppose my q7 data represents a Qm.n number of the form Q(-1).8, i.e., representing A/(2^8) where A is the integer value of the data. This would be representing values in the range [-128/256, 127/256] = [-0.5, 0.49609375]. Is there a way to use arm_nn_activations_direct_q7 on such data, since the int_width = -1 is supposed to be unsigned?

Could I call arm_shift_q7 to right-shift by 1, call arm_nn_activations_direct_q7 with int_width = 0, and then left-shift the result by 1 to get back to Q(-1).8 format? Or would it be faster or more accurate to call arm_nn_activations_direct_q15 with int_width = 7 by interpreting the data as q15 in the form of Q7.8? Is there some example code for doing this? Thanks in advance!

felix-johnny commented 4 years ago

@arvoelke Thanks for the question. Not sure I quite follow the fixed point notation you have used. Representing A(-128 , 127)/2^8 requires 8(fractional) + 1(sign) bit => 9 bits. So, how would you represent that in q7 ? https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page is the closest how-to page that I could find for those legacy APIs(https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis)

arvoelke commented 4 years ago

Thanks for getting back to me.

Representing A(-128 , 127)/2^8 requires 8(fractional) + 1(sign) bit => 9 bits. So, how would you represent that in q7 ?

Basically the same as usual except you're missing the LSB (the 8'th fractional bit) since you can only store 7 of the 8 fractional bits. This is what's done here https://github.com/ARM-software/ML-KWS-for-MCU/blob/master/Deployment/Quant_guide.md (see "and the quantized biases have 8 bits for fractional point (i.e. range -0.5,0.5)" while they are using q7 to quantize the biases). This is from one of the papers in the first link you sent.

Everything works out the same as long as you shift things properly. The problem occurs when a negative shift is needed by the API, such as in my original post. From what I understand, the work-around I suggested is correct, but I'd like to check that I'm not overlooking some other possibility (i.e., an approach that is faster, more accurate, or more direct).

for those legacy APIs (https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN#legacy-vs-tfl-micro-compliant-apis)

I'm surprised, this is the first place I've seen that has said anything to the effect of the *_q7 and *_q15 CMSIS-NN kernels being a legacy API. None of the documentation or various examples I've found hosted by ARM have said this (not even the first link you sent). Supposing I am using q7, is there an advantage in switching to the new API when the legacy API can shift and saturate in one instruction?

Will the legacy API soon be deprecated? Is there any more information on the new API (e.g., benchmarking against the legacy API with x4/opt)?

felix-johnny commented 4 years ago

@arvoelke Thanks for the clarification. The workaround that you suggested in using the q15 API, sounds reasonable. Having looked at the implementation in arm_shift_q7(), my guess is that converting q7 to q15 and using arm_nn_activations_direct_q15() would be faster. Both aren't direct though. Can't think of a more direct one.

All of our current development(Post May 2019) is based on Tensor Flow Lite's symmetric quantization specification. The examples that you have seen are quite likely made before that. The link I sent was around the April 2019 time frame though. The main advantage is that you can directly use a TFL model with the CMSIS-NN APIs. Despite the additional cycles that you pointed out for requanatization, improvements are done in other areas. Both in terms of cycles and memory used. You'll notice that most of the common operators(DW conv, 1x1 conv, fully connected) do not use additional scratch buffer for optimization in the TFLu compatible versions.

There aren't any plans to deprecate the legacy APIs. Just bug fixes are expected to be done. As for benchmarking, because of the differences in what the APIs handle and the subsequent differences in the model used, it isn't a fair comparison to make. So, unfortunately there isn't a blog post that I could direct you for that.

What framework does your model that you are using come from?

Here is information about using the new APIs

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/kernels/cmsis-nn

arvoelke commented 4 years ago

Great thanks for all the details.

What framework does your model that you are using come from?

We're using Nengo (https://nengo.ai). You can find information on an early iteration targetting ARM here: https://ieeexplore.ieee.org/document/7280390. Since then, several advancements have been made to Nengo to enable optimizing both spiking and non-spiking neural networks (and hybrids) for energy-efficiency on ARM cores using backpropagation (among other techniques).