EdjeElectronics / TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi

A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!
Apache License 2.0
1.49k stars 684 forks source link

Why does the Coral USB Accelerator require more time than the CPU of the RasPi4 when analyzing single images with TFLite_detection_image.py ? #125

Open christianbaun opened 2 years ago

christianbaun commented 2 years ago

When using the Coral USB Accelerator on a RasPi 4 (4 GB) with Raspbian (debian 10.11), the performance of TFLite_detection_webcam.py for analyzing webcam videos is much better (12-20 FPS) than without the Coral USB Accelerator (3-4 FPS). This is a result as expected.

But when I use TFLite_detection_image.py for analyzing single images, it is faster using no Coral USB Accelerator.

Is this a normal observation? What causes the performance loss?

I modified TFLite_detection_image.py a bit, just to write the output into a file and not create a window.

https://github.com/christianbaun/pestdetector/blob/main/TFLite_detection_image_modified.py

When I check the time, that is required, to analyze an image with the command line tool time, the real time is longer than without the Coral USB Accelerator.

Without the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py \
--modeldir=/home/pi/model_2021_07_08 \
--graph=detect.tflite \
--labels=/home/pi/model_2021_07_08/labelmap.txt \
--image=testimage.jpg

real    0m1,174s
user    0m1,236s
sys 0m0,754s

With the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py \
--modeldir=/home/pi/model_2021_07_08 \
--graph=detect_edgetpu.tflite \
--labels=/home/pi/model_2021_07_08/labelmap.txt \
--edgetpu \
--image=testimage.jpg 

real    0m3,831s
user    0m1,118s
sys 0m0,729s

I also tried a loop over 170 images and the result was a real time less than 2 minutes without the Coral USB Accelerator compared with more than 10 minutes when using the Coral USB Accelerator.

What causes this? Why does the Coral USB Accelerator influence the performance when negatively when analyzing single images?

Is there any chance to improve the situation and have some benefit when using the Coral USB Accelerator for this (non-video) purpose?

$ uname -a
Linux raspberrypi 5.10.63-v7l+ #1496 SMP Wed Dec 1 15:58:56 GMT 2021 armv7l GNU/Linux
christianbaun commented 2 years ago

The performance of the usb port is not the root cause of the issue.

$ lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    |__ Port 2: Dev 4, If 0, Class=Application Specific Interface, Driver=, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M

The Coral USB Accelerator is connected to oneof the USB 3.0 ports of the RasPi4.

christianbaun commented 2 years ago

Maybe an explanation for the bad performance can be found in the Google Coral examples:

https://github.com/google-coral/pycoral/blob/master/examples/classify_image.py

Note: The first inference on Edge TPU is slow because it includes',
        'loading the model into Edge TPU memory.')

This makes sense, and when using the Coral USB Accelerator to several image files, the model is loaded into the TPU memory for every image file. This overhead does occur just a single time when a video file or stream is processed.

If this is the root cause for the bad performance, I see only two possible solutions:

  1. Copy the model into the TPU memory in advance and avoid loading it for every image (is this possible?) or
  2. Provide a stream instead of single images.

If all these assumptions are correct, is using the Coral USB Accelerator useful at all for analyzing single images?

EdjeElectronics commented 2 years ago

Thanks for investigating this! I'm not surprised that it takes more time to run a single image, but it should be faster when you run multiple images. How did you test the 170 images? You can use the imagedir argument to point the script at a folder full of images (e.g. --imagedir=testimages), and it will loop through all of them without needing to modify the code.

Can you try putting the 170 images in a folder, running the TFLite_detection_image.py script, and showing me the resulting execution time with and without the USB Accelerator?

christianbaun commented 2 years ago

I tested the 170 images with the Coral USB Accelerator this way:

#!/bin/bash
NUMBER_OF_RUNS=0
for datei in $(find "/home/pi/images/" -type f | egrep -i "\.jpg|\.jpeg")
do
  NUMBER_OF_RUNS=$(echo "${NUMBER_OF_RUNS} + 1" | bc)
  echo "run ${NUMBER_OF_RUNS}"
  python3 ../TFLite_detection_image_modified.py \
  --modeldir=/home/pi/model_2021_07_08 \
  --graph=detect_edgetpu.tflite \
  --labels=/home/pi/model_2021_07_08/labelmap.txt \
  --edgetpu \
  --image=${DATEI}
done
$ time ./performance_test_coral_tpu 
...
real    10m7,781s
user    2m39,826s
sys 1m53,368s

And I tested the 170 images without the Coral USB Accelerator this way:

#!/bin/bash
NUMBER_OF_RUNS=0
for datei in $(find "/home/pi/images/" -type f | egrep -i "\.jpg|\.jpeg")
do
  NUMBER_OF_RUNS=$(echo "${NUMBER_OF_RUNS} + 1" | bc)
  echo "run ${NUMBER_OF_RUNS}"
  python3 ../TFLite_detection_image_modified.py \
  --modeldir=/home/pi/model_2021_07_08 \
  --graph=detect.tflite \
  --labels=/home/pi/model_2021_07_08/labelmap.txt \
  --image=${DATEI}
done
$ time ./performance_test
...
real    1m58,900s
user    2m32,782s
sys 1m41,334s

Sadly, the modification of your code I did to write the image to a file (instead of opening a window), returns a nasty error message when I use the --imagedir argument instead of --image. And sadly, I am not smart enough to fix this.

$ python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08 --graph=detect_edgetpu.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt --edgetpu --imagedir=/home/pi/images/
/home/pi/model_2021_07_08/detect_edgetpu.tflite

Traceback (most recent call last):
  File "TFLite_detection_image_modified.py", line 223, in <module>
    cv2.imwrite(filename, image)
cv2.error: OpenCV(4.5.4) /tmp/pip-wheel-2c57qphc/opencv-python_86774b87799240fbaa4c11c089d08cc3/opencv/modules/imgcodecs/src/loadsave.cpp:728: error: (-2:Unspecified error) could not find a writer for the specified extension in function 'imwrite_'

Your code works perfectly, but I cannot measure the time because a window is created.

christianbaun commented 2 years ago

I succeded in fixing my modified version of your script in a way that the --imagedir argument works again. I did test runs with the same folder of images I tested 3-4 weeks ago, and the acceleration effect of the Coral TPU accelerator is visible.

Without the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08/ --graph=detect.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt --imagedir=/home/pi/images/
...
real    0m58,304s
user    0m54,717s
sys 0m5,715s

With the Coral USB Accelerator:

$ time python3 TFLite_detection_image_modified.py --modeldir=/home/pi/model_2021_07_08/ --graph=detect_edgetpu.tflite --labels=/home/pi/model_2021_07_08/labelmap.txt  --edgetpu --imagedir=/home/pi/images/
...
real    0m20,073s
user    0m10,619s
sys 0m5,698s

When I compare this with the measurements, I did 3-4 weegs ago, it is obvious that working with folders of images performs much better compared with handling single-images.

The results of using the Coral USB Accelerator in directory-mode are approx. 6 times better compared with using just the CPU in single-image mode and it is approx. 30 times better compared with using the Coral TPU accelerator in single-image mode.

I also wonder what using just the CPU in directory-mode is approx. twice as fast compared with with using just the CPU in single-image mode. This is a strong acceleration. I had not expected such a strong effect, probably caused by the overhead of starting/stopping the python interpreter for every image and having several more context switches (process switching).

It is sad that I probably cannot accelerate the workflow for single images with the Coral TPU accelerator, because it is according to my opinion the most flexible use case.

I still dig for a solution here. But up to now, no solution is found. I anyone here has an idea, then I would appreciate a reply here or a message.