XiaoMi / mobile-ai-bench

Benchmarking Neural Network Inference on Mobile Devices
Apache License 2.0
353 stars 57 forks source link

The speed of quantized tflite model is slower than its float model. #20

Closed liyancas closed 5 years ago

liyancas commented 5 years ago

I benchmarked the models on the OnePlus 3T platform, The performance of tflite quantized models are worse than the float modes. Has anyone run into the same issue?

The command I used:

python tools/benchmark.py --output_dir=output --frameworks=all \ --runtimes=all --model_names=all \ --target_abis=arm64-v8a

model_name device_name soc abi runtime MACE SNPE NCNN TFLITE
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a CPU 886.312 664.404 1578.295 997
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a DSP   4.996    
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a GPU 153.049 141.246    
InceptionV3Quant ONEPLUS A3010 msm8996 arm64-v8a CPU       1014.75
MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a CPU 52.046 385.444 37.883 71.367
MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a GPU 25.267 24.441    
MobileNetV1Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 36.743     145.778
MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a CPU 40.625 413.553 29.208 76.021
MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a GPU 17.546 14.966    
MobileNetV2Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 28.679     294.099
SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a CPU 37.453 59.481 21.376  
SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a GPU 20.001 17.986    
VGG16 ONEPLUS A3010 msm8996 arm64-v8a CPU 452.521 1002.442 821.195  
VGG16 ONEPLUS A3010 msm8996 arm64-v8a DSP   136.465    
VGG16 ONEPLUS A3010 msm8996 arm64-v8a GPU 196.507      
liyancas commented 5 years ago

Another question: For Inception V3, the performance of DSP is ~132x faster than CPU. Is it normal?

llhe commented 5 years ago

DSP number looks like problematic, SNPE may has some errors for Snapdragon 821. @lee-bin @lydoc Can you have a look?

lee-bin commented 5 years ago

Another question: For Inception V3, the performance of DSP is ~132x faster than CPU. Is it normal?

I tested it on msm8996, and get results below. You can check the benchmark log, maybe it does not exit normally?

model_name device_name soc abi runtime MACE SNPE NCNN TFLITE
InceptionV3 MI 5s msm8996 armeabi-v7a DSP 67.856
VGG16 MI 5s msm8996 armeabi-v7a DSP 141.415
lydoc commented 5 years ago

I benchmarked the models on the OnePlus 3T platform, The performance of tflite quantized models are worse than the float modes. Has anyone run into the same issue?

The command I used:

python tools/benchmark.py --output_dir=output --frameworks=all \ --runtimes=all --model_names=all \ --target_abis=arm64-v8a

model_name device_name soc abi runtime MACE SNPE NCNN TFLITE InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a CPU 886.312 664.404 1578.295 997 InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a DSP 4.996
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a GPU 153.049 141.246
InceptionV3Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 1014.75 MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a CPU 52.046 385.444 37.883 71.367 MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a GPU 25.267 24.441
MobileNetV1Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 36.743 145.778 MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a CPU 40.625 413.553 29.208 76.021 MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a GPU 17.546 14.966
MobileNetV2Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 28.679 294.099 SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a CPU 37.453 59.481 21.376
SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a GPU 20.001 17.986
VGG16 ONEPLUS A3010 msm8996 arm64-v8a CPU 452.521 1002.442 821.195
VGG16 ONEPLUS A3010 msm8996 arm64-v8a DSP 136.465
VGG16 ONEPLUS A3010 msm8996 arm64-v8a GPU 196.507

This issue was caused by the num_threads argument. We use available big cores to benchmark by taskset command for TFLITE benchmark and the default number of threads is 4. For msm8996, it has 2 big cores and 2 little, so you can use :

python tools/benchmark.py --output_dir=output --frameworks=all \
--runtimes=all --model_names=all \
--target_abis=arm64-v8a \
--num_threads=2

Besides, for some SoCs such as msm8996, it may be faster by using all of the CPUs instead of binding big cores CPUs. You can change the relevant code: https://github.com/XiaoMi/mobile-ai-bench/blob/ff6667dafe6189b04724e583913d968322ca7c0e/tools/sh_commands.py#L394

liyancas commented 5 years ago

@lee-bin Does XiaoMI 5S support both OpenCL GPU and DSP? I have tried Pix phone with msm8996, and got a runtime exception. The same issue can be found at https://developer.qualcomm.com/forum/qdn-forums/software/snapdragon-neural-processing-engine-sdk/34526

liyancas commented 5 years ago

@lydoc I will try again later. Many thanks.

lee-bin commented 5 years ago

@liyancas Yes. MI 5S support both OpenCL GPU and DSP, SNPE works fine on CPU/GPU/DSP of MI 5S. So it seems like a problem of Pixel phone.

llhe commented 5 years ago

Google does not support OpenCL in their devices (unofficial source says that it's due to OpenCL trademark is hold by Apple).