Samsung / ONE

On-device Neural Engine
Other
435 stars 157 forks source link

[onert] Quantization kernel performance and memory usage status #4066

Open hseok-oh opened 4 years ago

hseok-oh commented 4 years ago

Test model

Test setting

Performance result (ubuntu 18.04)

Execution time

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 12.791 8.962 40.188 12.042
comparision 42.98 42.749 127.911 126.398
tensor000 37.934 15.899 9.361 4.794
tensor001 103.697 69.477 51.951 59.625
unary 129.576 137.853 279.145 122.516
inception_v3 1773.688 1541.889 520.773 357.919
inception_v4 3443.901 3075.055 1838.491 778.056
mobilenet_v1_1.0_224 307.432 497.351 66.157 47.396

Comparison with tflite(float)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 1 1.43 0.32 1.06
comparision 1 1.01 0.34 0.34
tensor000 1 2.39 4.05 7.91
tensor001 1 1.49 2.00 1.74
unary 1 0.94 0.46 1.06
inception_v3 1 1.15 3.41 4.96
inception_v4 1 1.12 1.87 4.43
mobilenet_v1_1.0_224 1 0.62 4.65 6.49
Geomean 1 1.18 1.36 2.29

Comparison with tflite(quant)

      tflite (quint8) cpu(quint8)
arithmetic     1 3.34
comparision     1 1.01
tensor000     1 1.95
tensor001     1 0.87
unary     1 2.28
inception_v3     1 1.46
inception_v4     1 2.36
mobilenet_v1_1.0_224     1 1.40
Geomean     1 1.68

Memoery usage

Usage (KB)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 24756 20784 9044 9328
comparision 26328 22132 12724 13028
tensor000 21820 20864 8304 9528
tensor001 41416 37308 13020 13704
unary 21884 20824 8220 9572
inception_v3 208084 115824 28120 39224
inception_v4 351908 188444 52264 58640
mobilenet_v1_1.0_224 44448 31396 7252 12252

Comparison with tflite(float)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 100% 84% 37% 38%
comparision 100% 84% 48% 49%
tensor000 100% 96% 38% 44%
tensor001 100% 90% 31% 33%
unary 100% 95% 38% 44%
inception_v3 100% 56% 14% 19%
inception_v4 100% 54% 15% 17%
mobilenet_v1_1.0_224 100% 71% 16% 28%
Geomean 100% 77% 27% 32%

Comparison with tflite(quant)

  tflite (quint8) cpu(quint8)
arithmetic     100% 103%
comparision     100% 102%
tensor000     100% 115%
tensor001     100% 105%
unary     100% 116%
inception_v3     100% 139%
inception_v4     100% 112%
mobilenet_v1_1.0_224     100% 169%
Geomean     100% 119%

Result

hseok-oh commented 4 years ago

Test on Odroid-XU4 Tizen

Performance result

Execution time

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 52.919 6.172 49.005 10.37
comparision 45.864 27.61 188.36 286.572
tensor000 133.531 19.704 62.275 21.853
tensor001 327.683 219.268 53.995 175.288
unary 159.103 167.817 318.581 306.664
inception_v3 2755.685 1491.922 1188.801 1062.103
inception_v4 5648.375 2854.986 2861.491 2202.942
mobilenet_v1_1.0_224 383.965 391.332 252.476 195.974

Comparison with tflite(float)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 1 8.57 1.08 5.10
comparision 1 1.66 0.24 0.16
tensor000 1 6.78 2.14 6.11
tensor001 1 1.49 6.07 1.87
unary 1 0.95 0.50 0.52
inception_v3 1 1.85 2.32 2.59
inception_v4 1 1.98 1.97 2.56
mobilenet_v1_1.0_224 1 0.98 1.52 1.96
Geomean 1 2.17 1.36 1.68

Comparison with tflite(quant)

      tflite (quint8) cpu(quint8)
arithmetic     1 4.73
comparision     1 0.66
tensor000     1 2.85
tensor001     1 0.31
unary     1 1.04
inception_v3     1 1.12
inception_v4     1 1.30
mobilenet_v1_1.0_224     1 1.29
Geomean     1.00 1.23

Memoery usage

Usage (KB)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 25052 21280 9336 9932
comparision 26632 22944 13040 13844
tensor000 22116 21452 8596 10092
tensor001 41380 38020 13416 14208
unary 22148 21292 8700 10108
inception_v3 208616 116324 28356 39776
inception_v4 352216 187732 49168 59220
mobilenet_v1_1.0_224 45324 32148 7612 12716

Comparison with tflite(float)

  tflite(float) cpu(float) tflite (quint8) cpu(quint8)
arithmetic 100% 85% 37% 40%
comparision 100% 86% 49% 52%
tensor000 100% 97% 39% 46%
tensor001 100% 92% 32% 34%
unary 100% 96% 39% 46%
inception_v3 100% 56% 14% 19%
inception_v4 100% 53% 14% 17%
mobilenet_v1_1.0_224 100% 71% 17% 28%
Geomean 100% 78% 27% 33%

Comparison with tflite(quant)

      tflite (quint8) cpu(quint8)
arithmetic     100% 106%
comparision     100% 106%
tensor000     100% 117%
tensor001     100% 106%
unary     100% 116%
inception_v3     100% 140%
inception_v4     100% 120%
mobilenet_v1_1.0_224     100% 167%
Geomean     100% 121%

Result

hseok-oh commented 4 years ago

Model operations

model file: http://npu.mooo.com/archive/nnpkg_test_model/nnpkg_quant.tar.gz

Total: 36 operations

model file unzip:

nnpkg
├── float
│   ├── inception_v3
│   ├── inception_v4
│   ├── mobilenet
│   ├── Model_Arithmetic
│   ├── Model_Comparison
│   ├── Model_Tensor_000
│   ├── Model_Tensor_001
│   └── Model_Unary
└── quant
    ├── inception_v3_quant
    ├── inception_v4_quant
    ├── mobilenet_quant
    ├── Model_Arithmetic_U8
    ├── Model_Comparison_U8
    ├── Model_Tensor_U8_000
    ├── Model_Tensor_U8_001
    └── Model_Unary_U8

nnpkg/float: FLOAT I/O model nnpkg/quant: UINT8 (quantized) I/O model

arithmetic (3 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Arithmetic_U8/Model_Arithmetic_U8.tflite

#0 b'main' (MAIN) input tensors: [0 1]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm1')
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm2')
#0 b'main' (MAIN) output tensors: [2 3 4]
        Tensor    2 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_add')
        Tensor    3 : buffer    4 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_sub')
        Tensor    4 : buffer    5 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_mul')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 3
        ADD                                   :    1
        MUL                                   :    1
        SUB                                   :    1
Number of all operators                       :    3

comparision (8 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Comparison_U8/Model_Comparison_U8.tflite

#0 b'main' (MAIN) input tensors: [0 1]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm1')
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm2')
#0 b'main' (MAIN) output tensors: [2 3 4 5 6 7 8 9]
        Tensor    2 : buffer    3 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_eq')
        Tensor    3 : buffer    4 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_gt')
        Tensor    4 : buffer    5 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_ge')
        Tensor    5 : buffer    6 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_lt')
        Tensor    6 : buffer    7 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_le')
        Tensor    7 : buffer    8 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_ne')
        Tensor    8 : buffer    9 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_max')
        Tensor    9 : buffer   10 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_min')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 8
        EQUAL                                 :    1
        GREATER                               :    1
        GREATER_EQUAL                         :    1
        LESS                                  :    1
        LESS_EQUAL                            :    1
        MAXIMUM                               :    1
        MINIMUM                               :    1
        NOT_EQUAL                             :    1
Number of all operators                       :    8

tensor000 (5 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Tensor_U8_000/Model_Tensor_U8_000.tflite -v 0

#0 b'main' (MAIN) input tensors: [0]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input')
#0 b'main' (MAIN) output tensors: [ 2  4  5  7  8  9 10 12]
        Tensor    2 : buffer    2 |  Empty | UINT8   | Memory 767.3K | Shape [1, 322, 244, 10] (b'output_pad')
        Tensor    4 : buffer    3 |  Empty | UINT8   | Memory 767.3K | Shape [1, 322, 244, 10] (b'output_pad2')
        Tensor    5 : buffer    4 |  Empty | INT32   | Memory 16.0B  | Shape [4] (b'output_shape')
        Tensor    7 : buffer    5 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split1')
        Tensor    8 : buffer    6 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split2')
        Tensor    9 : buffer    7 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split3')
        Tensor   10 : buffer    8 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split4')
        Tensor   12 : buffer    9 |  Empty | UINT8   | Memory 750.0K | Shape [1, 240, 320, 10] (b'output_transpose')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 5
        PAD                                   :    1
        PADV2                                 :    1
        SHAPE                                 :    1
        SPLIT                                 :    1
        TRANSPOSE                             :    1
Number of all operators                       :    5

tensor001 (6 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Tensor_U8_001/Model_Tensor_U8_001.tflite

#0 b'main' (MAIN) input tensors: [0 4]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [4, 160, 120, 10] (b'input')
        Tensor    4 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input2')
#0 b'main' (MAIN) output tensors: [ 3  6  8 11 13 14]
        Tensor    3 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_batch_to_space_nd')
        Tensor    6 : buffer    4 |  Empty | UINT8   | Memory 25.0K  | Shape [1, 320, 8, 10] (b'output_gather')
        Tensor    8 : buffer    5 |  Empty | UINT8   | Memory 2.9M   | Shape [1, 640, 480, 10] (b'output_resize_bilinear')
        Tensor   11 : buffer    6 |  Empty | UINT8   | Memory 93.8K  | Shape [1, 80, 120, 10] (b'output_slice')
        Tensor   13 : buffer    7 |  Empty | UINT8   | Memory 750.0K | Shape [4, 160, 120, 10] (b'output_space_to_batch_nd')
        Tensor   14 : buffer    8 |  Empty | UINT8   | Memory 750.0K | Shape [1, 160, 120, 40] (b'output_space_to_depth')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        BATCH_TO_SPACE_ND                     :    1
        GATHER                                :    1
        RESIZE_BILINEAR                       :    1
        SLICE                                 :    1
        SPACE_TO_BATCH_ND                     :    1
        SPACE_TO_DEPTH                        :    1
Number of all operators                       :    6

unary (6 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Unary_U8/Model_Unary_U8.tflite 

#0 b'main' (MAIN) input tensors: [0]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input')
#0 b'main' (MAIN) output tensors: [1 2 3 4 5 6]
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_l2_norm')
        Tensor    2 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_log_softmax')
        Tensor    3 : buffer    4 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_logistic')
        Tensor    4 : buffer    5 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_tanh')
        Tensor    5 : buffer    6 |  Empty | UINT8   | Memory 10.0B  | Shape [1, 10] (b'output_reduce_mean')
        Tensor    6 : buffer    7 |  Empty | UINT8   | Memory 10.0B  | Shape [1, 10] (b'output_reduce_sum')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        L2_NORMALIZATION                      :    1
        LOGISTIC                              :    1
        LOG_SOFTMAX                           :    1
        MEAN                                  :    1
        SUM                                   :    1
        TANH                                  :    1
Number of all operators                       :    6

Expected TOTAL  memory: 3.7M
Expected FILLED memory: 8.0B

inception v3 (5 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/inception_v3_quant/inception_v3_quant.tflite

#0 None (MAIN) input tensors: [315]
        Tensor  315 : buffer  257 |  Empty | UINT8   | Memory 261.9K | Shape [1, 299, 299, 3] (b'input')
#0 None (MAIN) output tensors: [316]
        Tensor  316 : buffer  247 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'output')

(operations)

Number of all operator types: 5
        AVERAGE_POOL_2D                       :   10
        CONCATENATION                         :   15
        CONV_2D                               :   95
        MAX_POOL_2D                           :    4
        RESHAPE                               :    1
Number of all operators                       :  125

inception v4 (2 more operations: FULLY_CONNECTED, SOFTMAX )

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/inception_v4_quant/inception_v4_299_quant.tflite

#0 None (MAIN) input tensors: [495]
        Tensor  495 : buffer  374 |  Empty | UINT8   | Memory 261.9K | Shape [1, 299, 299, 3] (b'input')
#0 None (MAIN) output tensors: [494]
        Tensor  494 : buffer  256 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'InceptionV4/Logits/Predictions')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        AVERAGE_POOL_2D                       :   15
        CONCATENATION                         :   25
        CONV_2D                               :  149
        FULLY_CONNECTED                       :    1
        MAX_POOL_2D                           :    4
        SOFTMAX                               :    1
Number of all operators                       :  195

mobilenet (1 more operation: DEPTHWISE_CONV_2D)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/mobilenet_quant/mobilenet_v1_1.0_224_quant.tflite

#0 None (MAIN) input tensors: [88]
        Tensor   88 : buffer   47 |  Empty | UINT8   | Memory 147.0K | Shape [1, 224, 224, 3] (b'input')
#0 None (MAIN) output tensors: [87]
        Tensor   87 : buffer   65 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'MobilenetV1/Predictions/Reshape_1')

(operations)

Number of all operator types: 5
        AVERAGE_POOL_2D                       :    1
        CONV_2D                               :   15
        DEPTHWISE_CONV_2D                     :   13
        RESHAPE                               :    1
        SOFTMAX                               :    1
Number of all operators                       :   31

Expected TOTAL  memory: 9.0M
Expected FILLED memory: 4.1M
lemmaa commented 4 years ago
No OP \ Model arithmetic comparision tensor000 tensor001 unary inception v3 inception v4 mobilenet
1 ADD O              
2 AVERAGE_POOL_2D           O O O
3 BATCH_TO_SPACE_ND       O        
4 CONCATENATION           O O  
5 CONV_2D           O O O
6 DEPTHWISE_CONV_2D               O
7 EQUAL   O            
8 FULLY_CONNECTED             O  
9 GATHER       O        
10 GREATER   O            
11 GREATER_EQUAL   O            
12 L2_NORMALIZATION         O      
13 LESS   O            
14 LESS_EQUAL   O            
15 LOG_SOFTMAX         O      
16 LOGISTIC         O      
17 MAX_POOL_2D           O O  
18 MAXIMUM   O            
19 MEAN         O      
20 MINIMUM   O            
21 MUL O              
22 NOT_EQUAL   O            
23 PAD     O          
24 PADV2     O          
25 RESHAPE           O O O
26 RESIZE_BILINEAR       O        
27 SHAPE     O          
28 SLICE       O        
29 SOFTMAX             O O
30 SPACE_TO_BATCH_ND       O        
31 SPACE_TO_DEPTH       O        
32 SPLIT     O          
33 SUB O              
34 SUM         O      
35 TANH         O      
36 TRANSPOSE     O