PINTO0309 / Tensorflow-bin

Prebuilt binary with Tensorflow Lite enabled. For RaspberryPi / Jetson Nano. Support for custom operations in MediaPipe. XNNPACK, XNNPACK Multi-Threads, FlexDelegate.
https://qiita.com/PINTO
Apache License 2.0
502 stars 113 forks source link

Is it really multi-threaded? #20

Closed zanazakaryaie closed 4 years ago

zanazakaryaie commented 4 years ago

[Required] Your device (RaspberryPi3, LaptopPC, or other device name):
Raspberry Pi 3B+

[Required] Your device's CPU architecture (armv7l, x86_64, or other architecture name):
armv7l

[Required] Your OS (Raspbian, Ubuntu1604, or other os name):
Raspbian Stretch

[Required] Details of the work you did before the problem occurred:
I just followed the instructions you mentioned step-by-step

[Required] Error message:
There is no error message.

[Required] Overview of problems and questions:
I'm using htop to see how many cores are used. setting --num_threads=1 works. It uses one single core. But setting --num_threads=4 doesn't show that 4 cores are used! Again one single core is used.

PINTO0309 commented 4 years ago

See the issue below. https://github.com/tensorflow/tensorflow/issues/35784

PINTO0309 commented 4 years ago

@zanazakaryaie See below for a sample that maximizes performance with Tensorflow Lite and RaspberryPi. When using MobileNet, it is quite fast to infer with only CPU. Tensorflow Lite is optimized for arm64 (aarch64) OS. https://github.com/PINTO0309/PINTO_model_zoo#pinto_model_zoo

zanazakaryaie commented 4 years ago

Thanks @PINTO0309 I had seen those links before and I just tested python3 mobilenetv2ssd.py that uses ssdlite_mobilenet_v2_coco_300_integer_quant_with_postprocess.tflite model. This is the output on Raspberry Pi 3B+:

resize and normalize time: 0.017828119000114384 inference + postprocess time: 0.23932114800027193 coordinates: (140, 117)-(570, 428). class: "1". probability: 0.96 coordinates: (461, 81)-(690, 172). class: "2". probability: 0.90 coordinates: (131, 220)-(315, 538). class: "17". probability: 0.90 TOTAL time: 0.2642660569999862

But htop still doesn't show the utilization of 4 cores.

PINTO0309 commented 4 years ago

@zanazakaryaie I just attached a video that was benchmarked using htop. With Raspbian (32bit), it has been proven that there is not enough performance with the cooperation of Japanese engineers. I know that when using a 32bit OS, performance is better with the Weight Quantization model than with the Integer Quantization model.

ezgif com-resize

zanazakaryaie commented 4 years ago

@PINTO0309 Thanks for the reply. So as fas as I understood, using 32bit OS, doesn't unleash the power of the Raspberry Pi CPU. I have to use 64bit OS to get sure that all threads are used. You have also noted that tensorflow lite has been optimized for 64bit OS. So I will switch to 64Bit OS.

One more question:

If I train a custom mobilenetSSD detector with tensorflow and transform it to tflite model with tensorflow tools, then can I generate those post-processed model easily? It seems that there are some tutorials here, but I just wanted to get sure about the procedure.

PINTO0309 commented 4 years ago

As an example, perform Integer Quantization conversion in the order of 00→01 and finally 03. If the input shape of Placeholder is not fixed like [1, 256, 256, 3], you can use 06_replace_placeholder to replace Placeholder. Note that all work must be done with Tensorflow v1.15.0. Also, before you start, you need to Clone the Tensorflow/models repository and add models/research and models/research/slim to PYTHONPATH.

https://github.com/PINTO0309/PINTO_model_zoo/tree/master/06_mobilenetv2-ssdlite/02_voc/01_float32

FireShot Capture 002 - PINTO_model_zoo_06_mobilenetv2-ssdlite_02_voc_01_float32 at master · _ - github com

zanazakaryaie commented 4 years ago

Thank you. So I can convert my custom MobilenetSSD TensorFlow model to those integer_quantized models. Just one more question I'm curious to ask:

Why does TensorFlow lite itself has not considered such post-processings? Could you briefly explain what goes on under the hood?

PINTO0309 commented 4 years ago

What does "TensorFlow lite itself has not considered such post-processings" refer to?

  1. About quantization
  2. Calculation of bounding box after receiving inference result
zanazakaryaie commented 4 years ago

As far as I know, TensorFlow lite performs quantization, i.e. float weights are converted to integer weights with some pruning (I guess). If this is true, what other post-processing do you perform on the models? Does your post-processing speed up the inference time? If they speed up, then why Google has not implemented them yet? Are your post-processing customized for Raspberry Pi archtiecture?

PINTO0309 commented 4 years ago

I am not familiar with academic matters, but I understand as follows.

  1. "Pruning" and "Quantization" of model are different processing
  2. "Pruning" sets all weights below a certain threshold to zero
  3. "Quantization" reduces numerical precision of weights (e.g. Float32->Float16->Uint8->Binary)
  4. "Quantization" dramatically reduces the amount of computation instead of reducing computational accuracy

By the way, I haven't yet "Pruned" the model.

Are your post-processing customized for Raspberry Pi archtiecture?

It is not optimized for the RaspberryPi architecture, but for the ARM64 NEON instructions. The rationale is that many of the sample programs around the world, including the official Google samples, are all for 64-bit ARM Android (e.g. armv8). I think "Raspbian (32bit OS)" has been abandoned by Google. RaspberryPi has a 64-bit ARM CPU, but only a 32-bit OS is officially released. Therefore, the original performance has not been demonstrated. In the armv8 architecture, various processes are optimized and accelerated at the hardware level.

Reference article (Japanese) - armv8 architecture https://news.mynavi.jp/article/20111031-arm_v8/2

zanazakaryaie commented 4 years ago

Thanks for the explanations. They were crystal clear. I appreciate you for sharing such beneficial work. Good luck with you