Open CyprienAmigon opened 1 month ago
Hello,
Thank you for your message.
There are several possible explanations for the differences in performance observed.
First I think that it is important to mention that performances provided in the model zoo README are obtained from benchmarking which is different from a real application.
The benchmarks are done using X-LINUX-AI v5.1.0, during benchmarking on STM32MP157F-DK2 the two CPUs are used to run inference.
As a first test, you should be able to reproduce the benchmark and find same results as the README results using the following command on your board:
Secondly, in a real application the CPUs of your board can be used for other purposes like pre-processing the data before running inference or post-processing the NN model outputs. In that case, the bandwidth allocated for inferencing could be lower than that in the benchmark which could explain an overhead depending on the CPU load.
But normally the inference overtime should not be so high. Another explanation could be that you didn't use the 2 CPUs of the board for inference in your Tensorflow Lite interpreter.
Could you please provide us with additional information on the code you use to test the model and measure the inference time
Thank you
Hi,
Thanks for your help.
I've tried to benchmark the ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
model using ST Edge Edge AI Developper Cloud and I'm getting the expected inference time : ~890ms.
Here is my code :
# Perform object detection on a video stream using a SSD Mobilenet model
import numpy as np
import cv2
from ssd_mobilenet_postprocess import postprocess_predictions
import time
USE_STM32MPU = False # Set the USE_STM32MPU flag to True if you are running this script on an STM32MPU board
# Fill in the variables according to the YAML file configuration detailled https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/object_detection/src/prediction
model_path = "../../models/ssd_mobilenet_v2_fpnlite_035_416_int8.tflite"
model_type = "ssd_mobilenet_v2_fpnlite"
class_names = ['person']
num_classes = len(class_names)
interpolation_type = cv2.INTER_NEAREST
rescaling_scale = 1/127.5
rescaling_offset = -1
# Load the TFLite model
if USE_STM32MPU:
import tflite_runtime.interpreter as tflite
interpreter_quant = tflite.Interpreter(model_path) # Load the TFLite model
else:
import tensorflow as tf
interpreter_quant = tf.lite.Interpreter(model_path)
# Initialize the video stream
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
interpreter_quant.allocate_tensors()
# Get input and ouput details of the model
input_details = interpreter_quant.get_input_details()[0]
outputs_details = interpreter_quant.get_output_details()
input_shape = input_details['shape']
input_index_quant = interpreter_quant.get_input_details()[0]["index"]
# Process images from video stream
while True:
ret, image = cap.read()
# Pre-process the image
if len(image.shape) != 3:
image = cv2.cvtColor(image, cv2.COLOR_GRAY2BGR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
height, width, _ = image.shape
resized_image = cv2.resize(image, (int(input_shape[1]), int(input_shape[2])), interpolation=interpolation_type) # Resize to match model's input shape
image_data = resized_image * rescaling_scale + rescaling_offset
input_image_shape = [height, width]
image_processed = (image_data / input_details['quantization'][0]) + input_details['quantization'][1]
image_processed = np.clip(np.round(image_processed), np.iinfo(input_details['dtype']).min, np.iinfo(input_details['dtype']).max)
image_processed = image_processed.astype(input_details['dtype'])
image_data = image_processed
image_processed = np.expand_dims(image_data, 0)
# Set the input tensor
interpreter_quant.set_tensor(input_index_quant, image_processed)
# Run inference
start_time = time.time()
interpreter_quant.invoke()
end_time = time.time()
# Get the output tensors
predictions = [interpreter_quant.get_tensor(outputs_details[j]["index"]) for j in range(len(outputs_details))]
# Post-process the predictions
preds_decoded = postprocess_predictions(predictions=predictions, image_size = [width,height], nms_thresh = 0.5, confidence_thresh = 0.6)
for c in preds_decoded:
for bb in preds_decoded[c]:
bbox_thick = int(0.6 * (height + width) / 600)
x1 = int(bb[1])
y1 = int(bb[2])
x2 = int(bb[3])
y2 = int(bb[4])
cv2.rectangle(image,(x1,y1), (x2, y2),(0,255,0),2)
cv2.putText(image, '{}-{:.2f}'.format(class_names[c-1],bb[0]), (x1,y1-2), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), bbox_thick//2, lineType=cv2.LINE_AA)
cv2.putText(image, 'Inference time: {:.2f} ms'.format((end_time - start_time)*1000), (10,30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), bbox_thick//2, lineType=cv2.LINE_AA)
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
cv2.imshow('image',image)
# Break the loop if 'q' is pressed
if cv2.waitKey(1) == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
An other file which contains the post processing function for sdd_mobilenet model is used. I've took them from stm32ai-modelzoo/object_detection/src/postprocessing/postprocess.py
Hi,
What you can try is to force the use of 2 CPUs in your Tensorflow Lite interpreter :
interpreter_quant = tflite.Interpreter(model_path,number_threads=2)
If you weren't already using both threads, you should see a significant improvement in performance.
Please let me know if it is better with this option.
Thank you
Unfortunately, specifying the number of threads to the interpreter did not change the behaviour. Thanks for your help.
Which version of X-LINUX-AI are you using on your STM32MP157F ?
I'm using : X-LINUX-AI version: v5.0.0
Hi,
I think that the problem comes from the use of openCV for the camera pipeline, I tried to run your code without the NN part (I commented the inference and the post process) and the 2 CPUs of the board are almost 100% used each ( I monitored on target with "top" ). The bandwidth allocated to run inference is therefore low.
For application that need a camera pipeline, we are using Gstreamer which is much more efficient in terms of CPU consumption than openCV on target.
You can find some example of Gstreamer use in X-LINUX-AI out of the box application like image-classification or object detection.
I hope this will help you,
Best regards
Hello,
I am using the model
ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
from object_detection/pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416 and the inference time are not as good as expected.I'm running this object detection model with Python on a STM32MP157F-DK2 with image resolution of 416x416x3. According to the table below (taken from here) the expected inference time should be around 894.00 ms. However, I'm experiencing inference times closer to 2000 ms.
What could be causing such a significant difference? Could it be due to the use of Python and the ST Linux distribution running in parallel ?
Reference MPU inference time based on COCO Person dataset (see Accuracy for details on dataset)