ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.33k stars 1.08k forks source link

Performance of arm_fully_connected_mat_q7_vec_q15 #557

Closed kenarsa closed 4 years ago

kenarsa commented 5 years ago

Is there a benchmark comparing runtime of the following?

1- arm_fully_connected_mat_q7_vec_q15 2- arm_fully_connected_mat_q7_vec_q15_opt 3- ANSI C implementation

I tried to look into this https://arxiv.org/abs/1801.06601 but unfortunately couldn't find a benchmark for FC layer. The reason I am asking is that I compared 1 to 3 and I am seeing a 20% slow down. I want to make sure that I am not missing something obvious. Thanks!

majianjia commented 5 years ago

I don't know how much exactly it can be. The idea of it is to parallelly perform 4x MAC instead of 1 in each loop using SIMD instruction. To enable it, you need to make sure your project is optimised for DSP and you turn on the support for FPU. Otherwise, it still uses the for loop to do the job.

I am using MDK so I have to put this in my project's pre-define to enable them ARM_MATH_CM4, __FPU_PRESENT = 1. If you haven't, please change them according to your environment.

kenarsa commented 5 years ago

I am profiling this on a cortex-m7 (imxrt1050) and have the following enabled when compiling: ARM_MATH_CM7 __FPU_PRESENT=1

I am using mcuxpresso. also when compiling I get the compilation warning below every line that uses __SIMD32() donno if this helps ...

warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]

for the reason you mentioned I was expecting a speed up around a factor of 4 but unfortunately can't get any speedup.

majianjia commented 5 years ago

@kenarsa I haven't used mcuxpresso before so i cannot help you with the environment. __SIMD32 is the key instruction for it to speed up.

I am using 8-bit variance:

  1. arm_fully_connected_mat_q7
  2. arm_fully_connected_mat_q7_opt
  3. Pure C implementation without opt
  4. Pure C implementation with opt

I did a quick benchmarking on my platform, STM32L476 overclocking to ~176MHz~ 156MHz, The results of them with 768 x 96 are

  1. 1859us.
  2. 1199us.
  3. 4834us
  4. 2926us.

The total MAC ops is 73728 (768 x 96) which is simply (input x output). The MAC ops/us and the MAC ops/MHz are listed. Higher the better

  1. 39.66 / 0.25M
  2. 61.49 / 0.39M
  3. 15.25 / 0.10M
  4. 25.19 / 0.16M

The MAC ops/MHz should help you to compare your implementation. Since M7 and M4 is optimised in the same way. From my understanding, 16-bit and 8-bit should have the same timing performance. Plz point me out if i am wrong.

Edited: correct numbers.

JonatanAntoni commented 5 years ago

Hey all,

The compiler warnings on __SIMD32 macro are caused by breaking the strict-aliasing rule. In fact the macro does evil things with the data pointers and depending on how aggressive your compiler tries to optimize the result might work correctly or not. Hence we have deprecated this macro in the meanwhile. We need to look for better solution while keeping performance in mind.

Cheers, Jonatan

kenarsa commented 5 years ago

@majianjia

Thank you very very much. This is extremely helpful. One quick question is that what is the difference between c implementation with opt and without opts? Just wanted to make sure I understand it correctly.

One thing to keep in mind is that I am using GCC. I am not sure if the performance difference is because of that? Thoughts?

This is how __SIMD32 is defined in my toolchain:

define SIMD32(addr) (*(SIMD32_TYPE **) & (addr))

kenarsa commented 5 years ago

Thank you, @JonatanAntoni . Now it makes sense. I was confused for a bit as it is a deprecated macro but also used heavily. I agree that the implementation of it is a bit scary.

majianjia commented 5 years ago

@kenarsa Before the C implementation, i think it is easier to show how the _opt works first.

arm_fully_connected_mat_q7
arm_fully_connected_mat_q7_opt

Both are using SIMD type instructions for optimation on ARM chips. (if you are not familiar with, Single Instruction Multiple Data) They take different weight format as input (opt takes reordered weights) arm_fully_connected_mat_q7 has to reorder the weight before SIMD operation while the other one does not. There comes to the performance difference.

Then the same thing happens to the C implementation. The C implementation with/without opt (optimisation) I am using is modified from nn_test They are pure C without SIMD type instructions and without SSAT/USAT (for overflow saturation). The performance differed by the different for-loops. The one using reordered weight give compiler more options for optimation since it is still trying to do 4 MAC operation together, while the loop in the other is doing MAC one by one.

I am not familiar with GCC, but if you can see in the source code the part controlled by #if defined (ARM_MATH_DSP) is compiled and linked in your project. It should mean that you have successfully turned on the optimation on your project.

kenarsa commented 5 years ago

Thanks a lot @majianjia. I am sure the code using SIMD is being executed. I think there are 2 possibilities at this point:

1- GCC creates much slower assembly compared to MDK. 2- I am storing DNN coefficients on FLASH. In your tests have you stored the DNN weights on FLASH or RAM? I am thinking read waits might be killing all the benefits ...

Again, thanks a bunch for getting back to me so fast.

majianjia commented 5 years ago

@kenarsa

  1. I think it wont be too much difference according to compilers. However, optimation will affect a lot. I use full default optimisation on MDK which is the highest option.

  2. If you are using an external QSPI flash on RT1052, it might be the case. I am using the embedded flash for weights, benefited by the ART acceleration from ST.

I will suggest you to calculate the performance and compare it to the results i tested above.

This is a test result from MNIST with a few convolutional layers and dense layers, using all available opt operations.

INFO: Start compile...
Layer        Activation    output shape      ops          memory            mem life-time
----------------------------------------------------------------------------------------------
 Input      -          - (  28,  28,   1)        0   (  784,  784,    0)    1 - - -  - - - - 
 Conv2D     - ReLU     - (  28,  28,  12)    84672   (  784, 9408,  432)    1 1 - -  - - - - 
 Max_Pool   -          - (  14,  14,  12)        0   ( 9408, 2352,    0)    1 - 1 -  - - - - 
 Conv2D     - ReLU     - (  14,  14,  24)   508032   ( 2352, 4704,  864)    1 1 - -  - - - - 
 Max_Pool   -          - (   7,   7,  24)        0   ( 4704, 1176,    0)    1 - 1 -  - - - - 
 Conv2D     - ReLU     - (   7,   7,  48)   508032   ( 1176, 2352, 1728)    1 1 - -  - - - - 
 Max_Pool   -          - (   4,   4,  48)        0   ( 2352,  768,    0)    1 - 1 -  - - - - 
 Dense      - ReLU     - (  96,   1,   1)    73728   (  768,   96,  768)    1 1 - -  - - - - 
 Dense      -          - (  10,   1,   1)      960   (   96,   10,   96)    1 - 1 -  - - - - 
 Softmax    -          - (  10,   1,   1)        0   (   10,   10,    0)    - 1 - -  - - - - 
 Output     -          - (  10,   1,   1)        0   (   10,   10,    0)    1 - - -  - - - - 
----------------------------------------------------------------------------------------------
INFO: memory analysis result
 Block0: 1728  Block1: 2352  Block2: 9408  Block3: 0  Block4: 0  Block5: 0  Block6: 0  Block7: 0  
 Total memory cost by network buffers: 13488 bytes

Test frames: 10001
Test running time: 485 sec
Model running time: 255864 ms
Average prediction time: 25583 us
Average effeciency: 45.96 ops/us
Average frame rate: 40.0 Hz
Top 1 Accuracy: 99.25% 
Top 2 Accuracy: 99.75% 
Confusion matrix:
predic     0     1     2     3     4     5     6     7     8     9
actual
   0 |   977     0     1     0     0     0     2     1     0     0   |  99%
   1 |     0  1133     0     2     0     0     0     0     0     0   |  99%
   2 |     1     0  1021     1     1     0     0     7     1     0   |  98%
   3 |     0     0     0  1006     0     2     0     1     1     0   |  99%
   4 |     0     0     1     0   977     0     3     0     0     1   |  99%
   5 |     1     0     0     5     0   885     1     0     0     0   |  99%
   6 |     4     2     0     0     1     1   948     0     2     0   |  98%
   7 |     0     2     1     1     0     0     0  1023     0     1   |  99%
   8 |     1     0     1     4     0     1     0     1   963     3   |  98%
   9 |     1     0     0     1     4     3     0     5     2   993   |  98%

Print running stat..
Layer(#)        -   Time(us)      ops(MACs)     ops/us 
--------------------------------------------------------
#1        Input -        10              0      
#2       Conv2D -      5162          84672      16.40
#3     Max_Pool -       536              0      
#4       Conv2D -      7872         508032      64.53
#5     Max_Pool -       227              0      
#6       Conv2D -      7404         508032      68.61
#7     Max_Pool -       105              0      
#8        Dense -      1197          73728      61.59
#9        Dense -        21            960      45.71
#10     Softmax -         3              0      
#11      Output -         1              0      
Summary.
Total ops (MAC): 1175424
Prediction time :22538us
Efficiency 52.15 ops/us
Total Memory cost (Network and NNoM): 15060

If you are interested with implementation like this, you can check NNoM framework. It really makes life easier.

kenarsa commented 5 years ago

Many many thanks @majianjia . This is helpful.

We do have a fixed-point inference engine in-house. We tried to design it so that we can support multiple DSPs and also MCUs. I am now trying to optimize it for ARM Cortex-M4/7. Here is a link to application we are working on if you are interested: https://www.youtube.com/watch?v=WadKhfLyqTQ

Let me work on this a bit and I will get back to you. hopefully early next week :)

majianjia commented 5 years ago

@kenarsa Thank you for the great demo, very impressive work! well done! Looking for your update,

kenarsa commented 5 years ago

Sorry for the delay. Using the toolchain I use (GCC) having a pure C implementation that unrolls the inner loop is faster than the CMSIS version. Now the read from the external FLASH can play a part as well and I will verify running from RAM or on-chip FLASH next. I will give MDK a try and see if it is different from GCC.

felix-johnny commented 4 years ago

@kenarsa Probably it is so late that it isn't relevant anymore, do you have -fno-builtin option set? As that combined with memcpy's could be why you don't see the expected performance uplift. https://github.com/tensorflow/tensorflow/issues/37361 is a similar issue.

kenarsa commented 4 years ago

didn't make a diff actually. but thanks for the follow-up. I'm closing the issue for now this may be a side effect of external flash on imxrt