Closed zanazakaryaie closed 4 years ago
See the issue below. https://github.com/tensorflow/tensorflow/issues/35784
@zanazakaryaie See below for a sample that maximizes performance with Tensorflow Lite and RaspberryPi. When using MobileNet, it is quite fast to infer with only CPU. Tensorflow Lite is optimized for arm64 (aarch64) OS. https://github.com/PINTO0309/PINTO_model_zoo#pinto_model_zoo
Thanks @PINTO0309
I had seen those links before and I just tested python3 mobilenetv2ssd.py
that uses ssdlite_mobilenet_v2_coco_300_integer_quant_with_postprocess.tflite
model. This is the output on Raspberry Pi 3B+:
resize and normalize time: 0.017828119000114384 inference + postprocess time: 0.23932114800027193 coordinates: (140, 117)-(570, 428). class: "1". probability: 0.96 coordinates: (461, 81)-(690, 172). class: "2". probability: 0.90 coordinates: (131, 220)-(315, 538). class: "17". probability: 0.90 TOTAL time: 0.2642660569999862
But htop
still doesn't show the utilization of 4 cores.
@zanazakaryaie I just attached a video that was benchmarked using htop. With Raspbian (32bit), it has been proven that there is not enough performance with the cooperation of Japanese engineers. I know that when using a 32bit OS, performance is better with the Weight Quantization model than with the Integer Quantization model.
@PINTO0309 Thanks for the reply. So as fas as I understood, using 32bit OS, doesn't unleash the power of the Raspberry Pi CPU. I have to use 64bit OS to get sure that all threads are used. You have also noted that tensorflow lite has been optimized for 64bit OS. So I will switch to 64Bit OS.
One more question:
If I train a custom mobilenetSSD detector with tensorflow and transform it to tflite model with tensorflow tools, then can I generate those post-processed model easily? It seems that there are some tutorials here, but I just wanted to get sure about the procedure.
As an example, perform Integer Quantization conversion in the order of 00→01 and finally 03. If the input shape of Placeholder is not fixed like [1, 256, 256, 3], you can use 06_replace_placeholder to replace Placeholder. Note that all work must be done with Tensorflow v1.15.0
. Also, before you start, you need to Clone the Tensorflow/models repository and add models/research
and models/research/slim
to PYTHONPATH
.
https://github.com/PINTO0309/PINTO_model_zoo/tree/master/06_mobilenetv2-ssdlite/02_voc/01_float32
Thank you. So I can convert my custom MobilenetSSD TensorFlow model to those integer_quantized models. Just one more question I'm curious to ask:
Why does TensorFlow lite itself has not considered such post-processings? Could you briefly explain what goes on under the hood?
What does "TensorFlow lite itself has not considered such post-processings" refer to?
As far as I know, TensorFlow lite performs quantization, i.e. float weights are converted to integer weights with some pruning (I guess). If this is true, what other post-processing do you perform on the models? Does your post-processing speed up the inference time? If they speed up, then why Google has not implemented them yet? Are your post-processing customized for Raspberry Pi archtiecture?
I am not familiar with academic matters, but I understand as follows.
By the way, I haven't yet "Pruned" the model.
Are your post-processing customized for Raspberry Pi archtiecture?
It is not optimized for the RaspberryPi architecture, but for the ARM64 NEON instructions. The rationale is that many of the sample programs around the world, including the official Google samples, are all for 64-bit ARM Android (e.g. armv8). I think "Raspbian (32bit OS)" has been abandoned by Google. RaspberryPi has a 64-bit ARM CPU, but only a 32-bit OS is officially released. Therefore, the original performance has not been demonstrated. In the armv8 architecture, various processes are optimized and accelerated at the hardware level.
Reference article (Japanese) - armv8 architecture https://news.mynavi.jp/article/20111031-arm_v8/2
Thanks for the explanations. They were crystal clear. I appreciate you for sharing such beneficial work. Good luck with you
[Required] Your device (RaspberryPi3, LaptopPC, or other device name):
Raspberry Pi 3B+
[Required] Your device's CPU architecture (armv7l, x86_64, or other architecture name):
armv7l
[Required] Your OS (Raspbian, Ubuntu1604, or other os name):
Raspbian Stretch
[Required] Details of the work you did before the problem occurred:
I just followed the instructions you mentioned step-by-step
[Required] Error message:
There is no error message.
[Required] Overview of problems and questions:
I'm using
htop
to see how many cores are used. setting--num_threads=1
works. It uses one single core. But setting--num_threads=4
doesn't show that 4 cores are used! Again one single core is used.