[onert] Quantization kernel performance and memory usage status

hseok-oh commented 4 years ago

Test model

Inception v3
Inception v4
mobilenet v1_1.0_224
Test models by #3886

Test setting

Base code
- 57b87a07753ca2b3ced20c437a39c449acc92285 + #4050 (build without hdf5) + #4059 (conv2d memory optimize)
Odroid-XU4 ubuntu 18.04
onert test
- Use nnpackage_run without HDF5 linking
tflite test
- Use tflite_run (THREAD= 4)
- Tensorflow Lite 1.13.1
Performance test: run 10 times, mean time
Memory usage test: run once, RSS peak usage

Performance result (ubuntu 18.04)

Execution time

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	12.791	8.962	40.188	12.042
comparision	42.98	42.749	127.911	126.398
tensor000	37.934	15.899	9.361	4.794
tensor001	103.697	69.477	51.951	59.625
unary	129.576	137.853	279.145	122.516
inception_v3	1773.688	1541.889	520.773	357.919
inception_v4	3443.901	3075.055	1838.491	778.056
mobilenet_v1_1.0_224	307.432	497.351	66.157	47.396

Comparison with tflite(float)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	1	1.43	0.32	1.06
comparision	1	1.01	0.34	0.34
tensor000	1	2.39	4.05	7.91
tensor001	1	1.49	2.00	1.74
unary	1	0.94	0.46	1.06
inception_v3	1	1.15	3.41	4.96
inception_v4	1	1.12	1.87	4.43
mobilenet_v1_1.0_224	1	0.62	4.65	6.49
Geomean	1	1.18	1.36	2.29

Comparison with tflite(quant)

	tflite (quint8)	cpu(quint8)
arithmetic	1	3.34
comparision	1	1.01
tensor000	1	1.95
tensor001	1	0.87
unary	1	2.28
inception_v3	1	1.46
inception_v4	1	2.36
mobilenet_v1_1.0_224	1	1.40
Geomean	1	1.68

Memoery usage

Usage (KB)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	24756	20784	9044	9328
comparision	26328	22132	12724	13028
tensor000	21820	20864	8304	9528
tensor001	41416	37308	13020	13704
unary	21884	20824	8220	9572
inception_v3	208084	115824	28120	39224
inception_v4	351908	188444	52264	58640
mobilenet_v1_1.0_224	44448	31396	7252	12252

Comparison with tflite(float)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	100%	84%	37%	38%
comparision	100%	84%	48%	49%
tensor000	100%	96%	38%	44%
tensor001	100%	90%	31%	33%
unary	100%	95%	38%	44%
inception_v3	100%	56%	14%	19%
inception_v4	100%	54%	15%	17%
mobilenet_v1_1.0_224	100%	71%	16%	28%
Geomean	100%	77%	27%	32%

Comparison with tflite(quant)

	tflite (quint8)	cpu(quint8)
arithmetic	100%	103%
comparision	100%	102%
tensor000	100%	115%
tensor001	100%	105%
unary	100%	116%
inception_v3	100%	139%
inception_v4	100%	112%
mobilenet_v1_1.0_224	100%	169%
Geomean	100%	119%

Result

QASYMM uint8 performance
- Better performance than tensorflow lite 1.13.1 float (x2.29) / qasymm-uint8 (x1.68)
QASYMM uint8 memory usage
- Better memory usage than tensorflow lite 1.13.1 float (-68%)
- Need more memory (+19%) than tensorflow lite 1.13.1 qasymm-uint8

hseok-oh commented 4 years ago

Test on Odroid-XU4 Tizen

Performance result

Execution time

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	52.919	6.172	49.005	10.37
comparision	45.864	27.61	188.36	286.572
tensor000	133.531	19.704	62.275	21.853
tensor001	327.683	219.268	53.995	175.288
unary	159.103	167.817	318.581	306.664
inception_v3	2755.685	1491.922	1188.801	1062.103
inception_v4	5648.375	2854.986	2861.491	2202.942
mobilenet_v1_1.0_224	383.965	391.332	252.476	195.974

Comparison with tflite(float)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	1	8.57	1.08	5.10
comparision	1	1.66	0.24	0.16
tensor000	1	6.78	2.14	6.11
tensor001	1	1.49	6.07	1.87
unary	1	0.95	0.50	0.52
inception_v3	1	1.85	2.32	2.59
inception_v4	1	1.98	1.97	2.56
mobilenet_v1_1.0_224	1	0.98	1.52	1.96
Geomean	1	2.17	1.36	1.68

Comparison with tflite(quant)

	tflite (quint8)	cpu(quint8)
arithmetic	1	4.73
comparision	1	0.66
tensor000	1	2.85
tensor001	1	0.31
unary	1	1.04
inception_v3	1	1.12
inception_v4	1	1.30
mobilenet_v1_1.0_224	1	1.29
Geomean	1.00	1.23

Memoery usage

Usage (KB)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	25052	21280	9336	9932
comparision	26632	22944	13040	13844
tensor000	22116	21452	8596	10092
tensor001	41380	38020	13416	14208
unary	22148	21292	8700	10108
inception_v3	208616	116324	28356	39776
inception_v4	352216	187732	49168	59220
mobilenet_v1_1.0_224	45324	32148	7612	12716

Comparison with tflite(float)

	tflite(float)	cpu(float)	tflite (quint8)	cpu(quint8)
arithmetic	100%	85%	37%	40%
comparision	100%	86%	49%	52%
tensor000	100%	97%	39%	46%
tensor001	100%	92%	32%	34%
unary	100%	96%	39%	46%
inception_v3	100%	56%	14%	19%
inception_v4	100%	53%	14%	17%
mobilenet_v1_1.0_224	100%	71%	17%	28%
Geomean	100%	78%	27%	33%

Comparison with tflite(quant)

	tflite (quint8)	cpu(quint8)
arithmetic	100%	106%
comparision	100%	106%
tensor000	100%	117%
tensor001	100%	106%
unary	100%	116%
inception_v3	100%	140%
inception_v4	100%	120%
mobilenet_v1_1.0_224	100%	167%
Geomean	100%	121%

Result

QASYMM uint8 performance
- Better performance than tensorflow lite 1.13.1 float (x1.68) / qasymm-uint8 (x1.23)
QASYMM uint8 memory usage
- Better memory usage than tensorflow lite 1.13.1 float (-67%)
- Need more memory (+21%) than tensorflow lite 1.13.1 qasymm-uint8

hseok-oh commented 4 years ago

Model operations

model file: http://npu.mooo.com/archive/nnpkg_test_model/nnpkg_quant.tar.gz

arithmetic (3 operations)
comparision (8 operations)
tensor000 (5 operations)
tensor001 (6 operations)
unary (6 operations)
inception v3 (5 operations)
inception v4 (2 more operations: FULLY_CONNECTED, SOFTMAX )
mobilenet (1 more operation: DEPTHWISE_CONV_2D)

Total: 36 operations

model file unzip:

nnpkg
├── float
│   ├── inception_v3
│   ├── inception_v4
│   ├── mobilenet
│   ├── Model_Arithmetic
│   ├── Model_Comparison
│   ├── Model_Tensor_000
│   ├── Model_Tensor_001
│   └── Model_Unary
└── quant
    ├── inception_v3_quant
    ├── inception_v4_quant
    ├── mobilenet_quant
    ├── Model_Arithmetic_U8
    ├── Model_Comparison_U8
    ├── Model_Tensor_U8_000
    ├── Model_Tensor_U8_001
    └── Model_Unary_U8

nnpkg/float: FLOAT I/O model nnpkg/quant: UINT8 (quantized) I/O model

arithmetic (3 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Arithmetic_U8/Model_Arithmetic_U8.tflite

#0 b'main' (MAIN) input tensors: [0 1]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm1')
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm2')
#0 b'main' (MAIN) output tensors: [2 3 4]
        Tensor    2 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_add')
        Tensor    3 : buffer    4 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_sub')
        Tensor    4 : buffer    5 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_mul')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 3
        ADD                                   :    1
        MUL                                   :    1
        SUB                                   :    1
Number of all operators                       :    3

comparision (8 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Comparison_U8/Model_Comparison_U8.tflite

#0 b'main' (MAIN) input tensors: [0 1]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm1')
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ifm2')
#0 b'main' (MAIN) output tensors: [2 3 4 5 6 7 8 9]
        Tensor    2 : buffer    3 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_eq')
        Tensor    3 : buffer    4 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_gt')
        Tensor    4 : buffer    5 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_ge')
        Tensor    5 : buffer    6 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_lt')
        Tensor    6 : buffer    7 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_le')
        Tensor    7 : buffer    8 |  Empty | BOOL    | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_ne')
        Tensor    8 : buffer    9 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_max')
        Tensor    9 : buffer   10 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'ofm_min')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 8
        EQUAL                                 :    1
        GREATER                               :    1
        GREATER_EQUAL                         :    1
        LESS                                  :    1
        LESS_EQUAL                            :    1
        MAXIMUM                               :    1
        MINIMUM                               :    1
        NOT_EQUAL                             :    1
Number of all operators                       :    8

tensor000 (5 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Tensor_U8_000/Model_Tensor_U8_000.tflite -v 0

#0 b'main' (MAIN) input tensors: [0]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input')
#0 b'main' (MAIN) output tensors: [ 2  4  5  7  8  9 10 12]
        Tensor    2 : buffer    2 |  Empty | UINT8   | Memory 767.3K | Shape [1, 322, 244, 10] (b'output_pad')
        Tensor    4 : buffer    3 |  Empty | UINT8   | Memory 767.3K | Shape [1, 322, 244, 10] (b'output_pad2')
        Tensor    5 : buffer    4 |  Empty | INT32   | Memory 16.0B  | Shape [4] (b'output_shape')
        Tensor    7 : buffer    5 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split1')
        Tensor    8 : buffer    6 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split2')
        Tensor    9 : buffer    7 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split3')
        Tensor   10 : buffer    8 |  Empty | UINT8   | Memory 187.5K | Shape [1, 320, 60, 10] (b'output_split4')
        Tensor   12 : buffer    9 |  Empty | UINT8   | Memory 750.0K | Shape [1, 240, 320, 10] (b'output_transpose')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 5
        PAD                                   :    1
        PADV2                                 :    1
        SHAPE                                 :    1
        SPLIT                                 :    1
        TRANSPOSE                             :    1
Number of all operators                       :    5

tensor001 (6 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Tensor_U8_001/Model_Tensor_U8_001.tflite

#0 b'main' (MAIN) input tensors: [0 4]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [4, 160, 120, 10] (b'input')
        Tensor    4 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input2')
#0 b'main' (MAIN) output tensors: [ 3  6  8 11 13 14]
        Tensor    3 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_batch_to_space_nd')
        Tensor    6 : buffer    4 |  Empty | UINT8   | Memory 25.0K  | Shape [1, 320, 8, 10] (b'output_gather')
        Tensor    8 : buffer    5 |  Empty | UINT8   | Memory 2.9M   | Shape [1, 640, 480, 10] (b'output_resize_bilinear')
        Tensor   11 : buffer    6 |  Empty | UINT8   | Memory 93.8K  | Shape [1, 80, 120, 10] (b'output_slice')
        Tensor   13 : buffer    7 |  Empty | UINT8   | Memory 750.0K | Shape [4, 160, 120, 10] (b'output_space_to_batch_nd')
        Tensor   14 : buffer    8 |  Empty | UINT8   | Memory 750.0K | Shape [1, 160, 120, 40] (b'output_space_to_depth')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        BATCH_TO_SPACE_ND                     :    1
        GATHER                                :    1
        RESIZE_BILINEAR                       :    1
        SLICE                                 :    1
        SPACE_TO_BATCH_ND                     :    1
        SPACE_TO_DEPTH                        :    1
Number of all operators                       :    6

unary (6 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/Model_Unary_U8/Model_Unary_U8.tflite 

#0 b'main' (MAIN) input tensors: [0]
        Tensor    0 : buffer    1 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'input')
#0 b'main' (MAIN) output tensors: [1 2 3 4 5 6]
        Tensor    1 : buffer    2 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_l2_norm')
        Tensor    2 : buffer    3 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_log_softmax')
        Tensor    3 : buffer    4 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_logistic')
        Tensor    4 : buffer    5 |  Empty | UINT8   | Memory 750.0K | Shape [1, 320, 240, 10] (b'output_tanh')
        Tensor    5 : buffer    6 |  Empty | UINT8   | Memory 10.0B  | Shape [1, 10] (b'output_reduce_mean')
        Tensor    6 : buffer    7 |  Empty | UINT8   | Memory 10.0B  | Shape [1, 10] (b'output_reduce_sum')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        L2_NORMALIZATION                      :    1
        LOGISTIC                              :    1
        LOG_SOFTMAX                           :    1
        MEAN                                  :    1
        SUM                                   :    1
        TANH                                  :    1
Number of all operators                       :    6

Expected TOTAL  memory: 3.7M
Expected FILLED memory: 8.0B

inception v3 (5 operations)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/inception_v3_quant/inception_v3_quant.tflite

#0 None (MAIN) input tensors: [315]
        Tensor  315 : buffer  257 |  Empty | UINT8   | Memory 261.9K | Shape [1, 299, 299, 3] (b'input')
#0 None (MAIN) output tensors: [316]
        Tensor  316 : buffer  247 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'output')

(operations)

Number of all operator types: 5
        AVERAGE_POOL_2D                       :   10
        CONCATENATION                         :   15
        CONV_2D                               :   95
        MAX_POOL_2D                           :    4
        RESHAPE                               :    1
Number of all operators                       :  125

inception v4 (2 more operations: `FULLY_CONNECTED`, `SOFTMAX` )

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/inception_v4_quant/inception_v4_299_quant.tflite

#0 None (MAIN) input tensors: [495]
        Tensor  495 : buffer  374 |  Empty | UINT8   | Memory 261.9K | Shape [1, 299, 299, 3] (b'input')
#0 None (MAIN) output tensors: [494]
        Tensor  494 : buffer  256 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'InceptionV4/Logits/Predictions')

(operations)

==== Model Stats (1 Subgraphs) ====

Number of all operator types: 6
        AVERAGE_POOL_2D                       :   15
        CONCATENATION                         :   25
        CONV_2D                               :  149
        FULLY_CONNECTED                       :    1
        MAX_POOL_2D                           :    4
        SOFTMAX                               :    1
Number of all operators                       :  195

mobilenet (1 more operation: `DEPTHWISE_CONV_2D`)

$ python3 tools/tflitefile_tool/model_parser.py nnpkg/quant/mobilenet_quant/mobilenet_v1_1.0_224_quant.tflite

#0 None (MAIN) input tensors: [88]
        Tensor   88 : buffer   47 |  Empty | UINT8   | Memory 147.0K | Shape [1, 224, 224, 3] (b'input')
#0 None (MAIN) output tensors: [87]
        Tensor   87 : buffer   65 |  Empty | UINT8   | Memory 1001.0B | Shape [1, 1001] (b'MobilenetV1/Predictions/Reshape_1')

(operations)

Number of all operator types: 5
        AVERAGE_POOL_2D                       :    1
        CONV_2D                               :   15
        DEPTHWISE_CONV_2D                     :   13
        RESHAPE                               :    1
        SOFTMAX                               :    1
Number of all operators                       :   31

Expected TOTAL  memory: 9.0M
Expected FILLED memory: 4.1M

lemmaa commented 4 years ago

No	OP \ Model	arithmetic	comparision	tensor000	tensor001	unary	inception v3	inception v4	mobilenet
1	ADD	O
2	AVERAGE_POOL_2D						O	O	O
3	BATCH_TO_SPACE_ND				O
4	CONCATENATION						O	O
5	CONV_2D						O	O	O
6	DEPTHWISE_CONV_2D								O
7	EQUAL		O
8	FULLY_CONNECTED							O
9	GATHER				O
10	GREATER		O
11	GREATER_EQUAL		O
12	L2_NORMALIZATION					O
13	LESS		O
14	LESS_EQUAL		O
15	LOG_SOFTMAX					O
16	LOGISTIC					O
17	MAX_POOL_2D						O	O
18	MAXIMUM		O
19	MEAN					O
20	MINIMUM		O
21	MUL	O
22	NOT_EQUAL		O
23	PAD			O
24	PADV2			O
25	RESHAPE						O	O	O
26	RESIZE_BILINEAR				O
27	SHAPE			O
28	SLICE				O
29	SOFTMAX							O	O
30	SPACE_TO_BATCH_ND				O
31	SPACE_TO_DEPTH				O
32	SPLIT			O
33	SUB	O
34	SUM					O
35	TANH					O
36	TRANSPOSE			O

Samsung / ONE

[onert] Quantization kernel performance and memory usage status #4066

Test model

Test setting

Performance result (ubuntu 18.04)

Execution time

Comparison with tflite(float)

Comparison with tflite(quant)

Memoery usage

Usage (KB)

Comparison with tflite(float)

Comparison with tflite(quant)

Result

Test on Odroid-XU4 Tizen

Performance result

Execution time

Comparison with tflite(float)

Comparison with tflite(quant)

Memoery usage

Usage (KB)

Comparison with tflite(float)

Comparison with tflite(quant)

Result

Model operations

arithmetic (3 operations)

comparision (8 operations)

tensor000 (5 operations)

tensor001 (6 operations)

unary (6 operations)

inception v3 (5 operations)

inception v4 (2 more operations: `FULLY_CONNECTED`, `SOFTMAX` )

mobilenet (1 more operation: `DEPTHWISE_CONV_2D`)

Samsung / ONE

[onert] Quantization kernel performance and memory usage status #4066

Test model

Test setting

Performance result (ubuntu 18.04)

Execution time

Comparison with tflite(float)

Comparison with tflite(quant)

Memoery usage

Usage (KB)

Comparison with tflite(float)

Comparison with tflite(quant)

Result

Test on Odroid-XU4 Tizen

Performance result

Execution time

Comparison with tflite(float)

Comparison with tflite(quant)

Memoery usage

Usage (KB)

Comparison with tflite(float)

Comparison with tflite(quant)

Result

Model operations

arithmetic (3 operations)

comparision (8 operations)

tensor000 (5 operations)

tensor001 (6 operations)

unary (6 operations)

inception v3 (5 operations)

inception v4 (2 more operations: FULLY_CONNECTED, SOFTMAX )

mobilenet (1 more operation: DEPTHWISE_CONV_2D)

inception v4 (2 more operations: `FULLY_CONNECTED`, `SOFTMAX` )

mobilenet (1 more operation: `DEPTHWISE_CONV_2D`)