ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.33k stars 1.08k forks source link

Run-time for 8-bit and 16-bit library functions #597

Closed reporider closed 5 years ago

reporider commented 5 years ago

The question is about run-time for 8-bit(using 8-bit library functions) Cifar-10 and 16-bit(using 16-bit library functions) Cifar-10 examples. 8-bit model takes more time than 16-bit models? Can anyone explain the reason behind this?

majianjia commented 5 years ago

Hi reporider, I don't know the answer yet but I do interested in your platform and test environments. How much slower is the 8bit? Are both model exactly the same structures?

felix-johnny commented 5 years ago

@reporider .. On top of the information requested by @reporider , is it the case that you use CMSIS-NN API's for the 8 bit model and some reference(non CMSIS-NN) functions for the 16 bit?

reporider commented 5 years ago

Both models have the same structure and only use CMSIS-NN fully connected and Relu functions. Model consists of 10 layers (5 fully connected and 5 relu layers). 8-bit model has a runtime of 34ms and 16-bit model has a runtime of 23ms. Runtime is calculated as explained in the link below http://www.keil.com/support/docs/971.htm

felix-johnny commented 5 years ago

@reporider I'll assume that your target has DSP extension and you are using from the subset of arm_fully_connected_q7/15() or arm_fully_connected_q7/15_opt() functions for my answer. Please correct me if that is not the case.

The commonality between the q7 and q15 functions is that they both use the __SMLAD intrinsic as the core of the optimization and has an operation in bits as 32 = 32 + (16 x 16) + (16 x 16). The 8 bit version requires an additional step to rearrange the input vector and the weights from 8 bits into 16 bits. For the input vector it uses the additional buffer provided in the argument and for the weights, it rearranges on the fly. This operation results in more cycles than the 16 bit version.

Hope that clarifies.

reporider commented 5 years ago

@felix-johnny I am using from the subset of arm_fully_connected_q7/15() or arm_fully_connected_q7/15_opt() functions as you said.

Also in the CNN model (4 CNN layers followed by 2 fully connected layers), the 8-bit CNN model has a run-time of 297ms and 16-bit CNN model has a run-time of 655ms. Can you clarify this confusion as the 8-bit model has less run-time compared to 16-bit model unlike the previous model with only fully connected layers? I use the following functions in CNN model

->arm_convolve_HWC_q7/q15_fast_nonsquare() ->arm_fully_connected_q7/15()

felix-johnny commented 5 years ago

@reporider I did a quick check on the cycles for arm_convolve_HWC_q7/q15_fast_nonsquare() using a tensor of size 16x16x32 ; kernel of size 4x4x32x32; output size of 13x13x32 and using ARM Compiler 6 @ -02 optimization level on Cortex M7 target(ST Nucleo board). The q7 version was relatively slower(~30 %) than the q15 version. They use differing variants of im2col optimization technique and is fair to say that won't match in terms of cycles consumed.

With the details provided, I am not sure why see a lower performance for the 16 bit version. You could profile the functions one at a time and check. But, if you would like us to check, please provide all of the information below.

  1. API calls complete with all arguments. e.g, arm_convolve_HWC_q15_fast_nonsquare( NA, 32, 32 , ...NA, NA) where NA's are the buffers.
  2. Target in use
  3. Compiler version
  4. Compiler optimization level
reporider commented 5 years ago

@felix-johnny Please find the information below and help me to understand runtimes for 8-bit and 16-bit CNN models.

1.API calls and arguments CNN model architecture (API arguments are similar to cifar10 example) arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_convolve_HWC_q7/q15_fast_nonsquare arm_relu_q7/q15 arm_fully_connected_q7/q15 arm_relu_q7/q15 arm_fully_connected_q7/q15 arm_softmax_q7/q15

parameters.h (8-bit/16bit)

define CONV1_IP_DIM_X 22

define CONV1_IP_DIM_Y 1

define CONV1_IP_CH 4

define CONV1_KER_DIM_X 3

define CONV1_KER_DIM_Y 1

define CONV1_PADDING_X 1

define CONV1_PADDING_Y 0

define CONV1_STRIDE_X 1

define CONV1_STRIDE_Y 1

define CONV1_OUT_CH 128

define CONV1_OUT_DIM_X 22

define CONV1_OUT_DIM_Y 1

define CONV2_IP_DIM_X 22

define CONV2_IP_DIM_Y 1

define CONV2_IP_CH 128

define CONV2_KER_DIM_X 3

define CONV2_KER_DIM_Y 1

define CONV2_PADDING_X 1

define CONV2_PADDING_Y 0

define CONV2_STRIDE_X 1

define CONV2_STRIDE_Y 1

define CONV2_OUT_CH 64

define CONV2_OUT_DIM_X 22

define CONV2_OUT_DIM_Y 1

define CONV3_IP_DIM_X 22

define CONV3_IP_DIM_Y 1

define CONV3_IP_CH 64

define CONV3_KER_DIM_X 3

define CONV3_KER_DIM_Y 1

define CONV3_PADDING_X 1

define CONV3_PADDING_Y 0

define CONV3_STRIDE_X 1

define CONV3_STRIDE_Y 1

define CONV3_OUT_CH 32

define CONV3_OUT_DIM_X 22

define CONV3_OUT_DIM_Y 1

define CONV4_IP_DIM_X 22

define CONV4_IP_DIM_Y 1

define CONV4_IP_CH 32

define CONV4_KER_DIM_X 3

define CONV4_KER_DIM_Y 1

define CONV4_PADDING_X 1

define CONV4_PADDING_Y 0

define CONV4_STRIDE_X 1

define CONV4_STRIDE_Y 1

define CONV4_OUT_CH 16

define CONV4_OUT_DIM_X 22

define CONV4_OUT_DIM_Y 1

define IP1_DIM 22116

define IP1_OUT 64

define IP2_DIM 64

define IP2_OUT 13

  1. target - nRF52840(Cortex-M4)
  2. compiler version - V6.10.1
  3. compiler optimization level - Level 0 (-O0)
felix-johnny commented 5 years ago

@reporider Is it this example that you are trying to work out? https://github.com/ARM-software/ML-examples/blob/master/cmsisnn-cifar10/code/m4/nn.cpp

reporider commented 5 years ago

@felix-johnny yes, but the application is not for image classification. Please let me know if you need more information to check runtimes of 8bit/16bit CNN models.