STMicroelectronics / stm32ai-modelzoo

AI Model Zoo for STM32 devices
Other
261 stars 66 forks source link

Reduce inference time #43

Open CyprienAmigon opened 1 month ago

CyprienAmigon commented 1 month ago

Hello,

I am using the model ssd_mobilenet_v2_fpnlite_035_416_int8.tflite from object_detection/pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416 and the inference time are not as good as expected.

I'm running this object detection model with Python on a STM32MP157F-DK2 with image resolution of 416x416x3. According to the table below (taken from here) the expected inference time should be around 894.00 ms. However, I'm experiencing inference times closer to 2000 ms.

What could be causing such a significant difference? Could it be due to the use of Python and the ST Linux distribution running in parallel ?

Reference MPU inference time based on COCO Person dataset (see Accuracy for details on dataset)

Model Format Resolution Quantization Board Execution Engine Frequency Inference time (ms) %NPU %GPU %CPU X-LINUX-AI version Framework
SSD Mobilenet v2 0.35 FPN-lite Int8 192x192x3 per-channel** STM32MP257F-DK2 NPU/GPU 800 MHz 35.08 ms 6.20 93.80 0 v5.1.0 OpenVX
SSD Mobilenet v2 0.35 FPN-lite Int8 224x224x3 per-channel** STM32MP257F-DK2 NPU/GPU 800 MHz 48.92 ms 6.19 93.81 0 v5.1.0 OpenVX
SSD Mobilenet v2 0.35 FPN-lite Int8 256x256x3 per-channel** STM32MP257F-DK2 NPU/GPU 800 MHz 40.66 ms 7.07 92.93 0 v5.1.0 OpenVX
SSD Mobilenet v2 0.35 FPN-lite Int8 416x416x3 per-channel** STM32MP257F-DK2 NPU/GPU 800 MHz 110.4 ms 4.47 95.53 0 v5.1.0 OpenVX
SSD Mobilenet v2 0.35 FPN-lite Int8 192x192x3 per-channel STM32MP157F-DK2 2 CPU 800 MHz 193.70 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 224x224x3 per-channel STM32MP157F-DK2 2 CPU 800 MHz 263.60 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 256x256x3 per-channel STM32MP157F-DK2 2 CPU 800 MHz 339.40 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 416x416x3 per-channel STM32MP157F-DK2 2 CPU 800 MHz 894.00 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 192x192x3 per-channel STM32MP135F-DK2 1 CPU 1000 MHz 287.40 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 224x224x3 per-channel STM32MP135F-DK2 1 CPU 1000 MHz 383.40 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 256x256x3 per-channel STM32MP135F-DK2 1 CPU 1000 MHz 498.90 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
SSD Mobilenet v2 0.35 FPN-lite Int8 416x416x3 per-channel STM32MP135F-DK2 1 CPU 1000 MHz 1348.00 ms NA NA 100 v5.1.0 TensorFlowLite 2.11.0
mguSTM commented 1 month ago

Hello,

Thank you for your message.

There are several possible explanations for the differences in performance observed.

First I think that it is important to mention that performances provided in the model zoo README are obtained from benchmarking which is different from a real application.

The benchmarks are done using X-LINUX-AI v5.1.0, during benchmarking on STM32MP157F-DK2 the two CPUs are used to run inference.

As a first test, you should be able to reproduce the benchmark and find same results as the README results using the following command on your board:

Secondly, in a real application the CPUs of your board can be used for other purposes like pre-processing the data before running inference or post-processing the NN model outputs. In that case, the bandwidth allocated for inferencing could be lower than that in the benchmark which could explain an overhead depending on the CPU load.

But normally the inference overtime should not be so high. Another explanation could be that you didn't use the 2 CPUs of the board for inference in your Tensorflow Lite interpreter.

Could you please provide us with additional information on the code you use to test the model and measure the inference time

Thank you

CyprienAmigon commented 1 month ago

Hi,

Thanks for your help.

I've tried to benchmark the ssd_mobilenet_v2_fpnlite_035_416_int8.tflite model using ST Edge Edge AI Developper Cloud and I'm getting the expected inference time : ~890ms.

Here is my code :

# Perform object detection on a video stream using a SSD Mobilenet model 

import numpy as np
import cv2 
from ssd_mobilenet_postprocess import postprocess_predictions
import time 

USE_STM32MPU = False # Set the USE_STM32MPU flag to True if you are running this script on an STM32MPU board

# Fill in the variables according to the YAML file configuration detailled  https://github.com/STMicroelectronics/stm32ai-modelzoo/tree/main/object_detection/src/prediction
model_path =  "../../models/ssd_mobilenet_v2_fpnlite_035_416_int8.tflite" 
model_type = "ssd_mobilenet_v2_fpnlite"
class_names = ['person']
num_classes = len(class_names)
interpolation_type = cv2.INTER_NEAREST
rescaling_scale = 1/127.5
rescaling_offset = -1

# Load the TFLite model
if USE_STM32MPU:
    import tflite_runtime.interpreter as tflite
    interpreter_quant = tflite.Interpreter(model_path) # Load the TFLite model
else:
    import tensorflow as tf
    interpreter_quant = tf.lite.Interpreter(model_path) 

# Initialize the video stream
cap = cv2.VideoCapture(0) 
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

interpreter_quant.allocate_tensors()

# Get input and ouput details of the model
input_details = interpreter_quant.get_input_details()[0]
outputs_details = interpreter_quant.get_output_details()
input_shape = input_details['shape']
input_index_quant = interpreter_quant.get_input_details()[0]["index"]

# Process images from video stream
while True:
    ret, image = cap.read()

    # Pre-process the image 
    if len(image.shape) != 3:
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2BGR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    height, width, _ = image.shape
    resized_image = cv2.resize(image, (int(input_shape[1]), int(input_shape[2])), interpolation=interpolation_type) # Resize to match model's input shape
    image_data = resized_image * rescaling_scale + rescaling_offset 
    input_image_shape = [height, width] 

    image_processed = (image_data / input_details['quantization'][0]) + input_details['quantization'][1] 
    image_processed = np.clip(np.round(image_processed), np.iinfo(input_details['dtype']).min, np.iinfo(input_details['dtype']).max)
    image_processed = image_processed.astype(input_details['dtype'])

    image_data = image_processed
    image_processed = np.expand_dims(image_data, 0)

    # Set the input tensor 
    interpreter_quant.set_tensor(input_index_quant, image_processed)

    # Run inference
    start_time = time.time() 
    interpreter_quant.invoke()
    end_time = time.time()

    # Get the output tensors
    predictions = [interpreter_quant.get_tensor(outputs_details[j]["index"]) for j in range(len(outputs_details))]

    # Post-process the predictions
    preds_decoded = postprocess_predictions(predictions=predictions, image_size = [width,height], nms_thresh = 0.5, confidence_thresh = 0.6)

    for c in preds_decoded:
        for bb in preds_decoded[c]:
            bbox_thick = int(0.6 * (height + width) / 600)
            x1 = int(bb[1])
            y1 = int(bb[2])
            x2 = int(bb[3])
            y2 = int(bb[4])
            cv2.rectangle(image,(x1,y1), (x2, y2),(0,255,0),2)
            cv2.putText(image, '{}-{:.2f}'.format(class_names[c-1],bb[0]), (x1,y1-2), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), bbox_thick//2, lineType=cv2.LINE_AA)
            cv2.putText(image, 'Inference time: {:.2f} ms'.format((end_time - start_time)*1000), (10,30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), bbox_thick//2, lineType=cv2.LINE_AA)

    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    cv2.imshow('image',image)

    # Break the loop if 'q' is pressed
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

An other file which contains the post processing function for sdd_mobilenet model is used. I've took them from stm32ai-modelzoo/object_detection/src/postprocessing/postprocess.py

mguSTM commented 1 month ago

Hi,

What you can try is to force the use of 2 CPUs in your Tensorflow Lite interpreter :

interpreter_quant = tflite.Interpreter(model_path,number_threads=2)

If you weren't already using both threads, you should see a significant improvement in performance.

Please let me know if it is better with this option.

Thank you

CyprienAmigon commented 1 month ago

Unfortunately, specifying the number of threads to the interpreter did not change the behaviour. Thanks for your help.

mguSTM commented 1 month ago

Which version of X-LINUX-AI are you using on your STM32MP157F ?

CyprienAmigon commented 1 month ago

I'm using : X-LINUX-AI version: v5.0.0

mguSTM commented 1 month ago

Hi,

I think that the problem comes from the use of openCV for the camera pipeline, I tried to run your code without the NN part (I commented the inference and the post process) and the 2 CPUs of the board are almost 100% used each ( I monitored on target with "top" ). The bandwidth allocated to run inference is therefore low.

For application that need a camera pipeline, we are using Gstreamer which is much more efficient in terms of CPU consumption than openCV on target.

You can find some example of Gstreamer use in X-LINUX-AI out of the box application like image-classification or object detection.

I hope this will help you,

Best regards