Closed reporider closed 5 years ago
Hi reporider, I don't know the answer yet but I do interested in your platform and test environments. How much slower is the 8bit? Are both model exactly the same structures?
@reporider .. On top of the information requested by @reporider , is it the case that you use CMSIS-NN API's for the 8 bit model and some reference(non CMSIS-NN) functions for the 16 bit?
Both models have the same structure and only use CMSIS-NN fully connected and Relu functions. Model consists of 10 layers (5 fully connected and 5 relu layers). 8-bit model has a runtime of 34ms and 16-bit model has a runtime of 23ms. Runtime is calculated as explained in the link below http://www.keil.com/support/docs/971.htm
@reporider I'll assume that your target has DSP extension and you are using from the subset of arm_fully_connected_q7/15() or arm_fully_connected_q7/15_opt() functions for my answer. Please correct me if that is not the case.
The commonality between the q7 and q15 functions is that they both use the __SMLAD intrinsic as the core of the optimization and has an operation in bits as 32 = 32 + (16 x 16) + (16 x 16). The 8 bit version requires an additional step to rearrange the input vector and the weights from 8 bits into 16 bits. For the input vector it uses the additional buffer provided in the argument and for the weights, it rearranges on the fly. This operation results in more cycles than the 16 bit version.
Hope that clarifies.
@felix-johnny I am using from the subset of arm_fully_connected_q7/15() or arm_fully_connected_q7/15_opt() functions as you said.
Also in the CNN model (4 CNN layers followed by 2 fully connected layers), the 8-bit CNN model has a run-time of 297ms and 16-bit CNN model has a run-time of 655ms. Can you clarify this confusion as the 8-bit model has less run-time compared to 16-bit model unlike the previous model with only fully connected layers? I use the following functions in CNN model
->arm_convolve_HWC_q7/q15_fast_nonsquare() ->arm_fully_connected_q7/15()
@reporider I did a quick check on the cycles for arm_convolve_HWC_q7/q15_fast_nonsquare() using a tensor of size 16x16x32 ; kernel of size 4x4x32x32; output size of 13x13x32 and using ARM Compiler 6 @ -02 optimization level on Cortex M7 target(ST Nucleo board). The q7 version was relatively slower(~30 %) than the q15 version. They use differing variants of im2col optimization technique and is fair to say that won't match in terms of cycles consumed.
With the details provided, I am not sure why see a lower performance for the 16 bit version. You could profile the functions one at a time and check. But, if you would like us to check, please provide all of the information below.
@felix-johnny Please find the information below and help me to understand runtimes for 8-bit and 16-bit CNN models.
1.API calls and arguments CNN model architecture (API arguments are similar to cifar10 example) arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_fully_connected_q7/q15 arm_relu_q7/q15 arm_fully_connected_q7/q15 arm_softmax_q7/q15
parameters.h (8-bit/16bit)
@reporider Is it this example that you are trying to work out? https://github.com/ARM-software/ML-examples/blob/master/cmsisnn-cifar10/code/m4/nn.cpp
@felix-johnny yes, but the application is not for image classification. Please let me know if you need more information to check runtimes of 8bit/16bit CNN models.
The question is about run-time for 8-bit(using 8-bit library functions) Cifar-10 and 16-bit(using 16-bit library functions) Cifar-10 examples. 8-bit model takes more time than 16-bit models? Can anyone explain the reason behind this?