Open cypamigon opened 1 week ago
I've tried to look into the C code at stm32ai_application_code/object_detection/Application/STM32H747I-DISCO/Src/CM7/app_postprocess.c
to see how they are post-processing the output of the model using the objdetect_ssd_st_pp_process()
function. However, this function comes from a precompiled library so I can't see the content.
In the README of this project, it is specified that some post-processing is required to extract the proper bounding box coordinates: "In the context of Object detections model there are several filtering algorithms to apply at the output of the model in order to get the proper bounding boxes."
Could you provide the required post-processing steps to correctly extract the bounding boxes?
Hi,
the output of ssd mobilenet v2 is expressed in anchors and relative bounding box. What you found so far is the array of bounding box, but this array of bounding box needs to be set relative to anchors.
From what I recall, each final bounding box n can be found using the bbox[n] , anchor[n] and prediction[n].
Where anchor[n] is the window coordinates, bbox[n] is a position relative to the window, here you have the final bbox coordinates ( probably needs to be multiplied by the image size) and prediction[n] is the prediction of your model ( in the bbox box there is a cup or no cup). All bounding boxes can be calculated and are available, but you are only interested in those that have a high prediction probability.
You see that the anchor and bounding boxes are expressed as a vector of N by 4. In each N you will find 4 values that can be interpreted as bounding box coordinates.
The formula can be found in this tensorflow code, commented at line 57. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/detection_postprocess.c
Ssd mobilenet v2 is based arround anchors, each anchor can be considered as a "window" in the image where the model is zooming and looking for a particular object. These windows vary in sizes depending on the model depth ( relative to first input layer). At the first layer where we still have a lot of spatial information, the anchor will be small because we could expect to find the features of a small cup for instance. Has we go deeper in the model, the original images is getting more and more convoluted, hence we lack spatial information but have more meaning, since the we went through multiple layers that extracted the "meaning" within the image, here we search in bigger anchors.
The modelzoo has changed quite a lot I could not find the python training script that also have some pointers to help you decode the output.
Hi @alexandre200 ,
Thanks for your help. I've tried to calculate the bounding boxes coordinates using anchors values, as explained in the link you provided :
// Object Detection model produces axis-aligned boxes in two formats:
// BoxCorner represents the upper left corner (xmin, ymin) and
// the lower right corner (xmax, ymax).
// CenterSize represents the center (xcenter, ycenter), height and width.
// BoxCornerEncoding and CenterSizeEncoding are related as follows:
// ycenter = y / y_scale * anchor.h + anchor.y;
// xcenter = x / x_scale * anchor.w + anchor.x;
// half_h = 0.5*exp(h/ h_scale)) * anchor.h;
// half_w = 0.5*exp(w / w_scale)) * anchor.w;
// ymin = ycenter - half_h
// ymax = ycenter + half_h
// xmin = xcenter - half_w
// xmax = xcenter + half_w
Unfortunately, the bounding boxes I'm getting from these calculation are still not relevant. For example, here is the result when I'm using the model ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
provided in this repository, which is trained for person detection :
And here is the Python script :
import numpy as np
import tensorflow as tf
import cv2
y_scale = 10.0
x_scale = 10.0
h_scale = 5.0
w_scale = 5.0
# Get images stream from the webcam
image_height, image_width = 480, 640
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, image_width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, image_height)
# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="ssd_mobilenet_v2_fpnlite_035_416_int8.tflite")
interpreter.allocate_tensors()
# Get input and ouput details of the model
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
model_image_height = input_details[0]['shape'][1]
model_image_width = input_details[0]['shape'][2]
# Process images from video stream
while True:
ret, frame = cap.read()
# Preprocess the input image
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame_resized = cv2.resize(frame_rgb, (model_image_width, model_image_height)) # Resize image to match model's expected sizing
input_data = np.expand_dims(frame_resized, axis=0).astype(np.uint8) # Add batch dimension and convert to uint8
# Set the input tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
# Run inference
interpreter.invoke()
# Get the output tensor
scores = interpreter.get_tensor(output_details[0]['index'])[0]
boxes = interpreter.get_tensor(output_details[1]['index'])[0]
anchors = interpreter.get_tensor(output_details[2]['index'])[0]
boxes_filtered = []
scores_filtered = []
# Loop over all detections and draw detection box if confidence is above minimum threshold
for i in range(len(boxes)):
if scores[i][1] > 0.7 :
# Decode the output to get bounding boxes coordinates (ymin, xmin, ymax, xmax) based on the anchors
y_box, x_box, h_box, w_box = boxes[i]
y_anchor, x_anchor, h_anchor, w_anchor = anchors[i]
y_center = y_box/y_scale * h_anchor + y_anchor
x_center = x_box/x_scale * w_anchor + x_anchor
half_h = 0.5 * (np.exp(h_box/h_scale) * h_anchor)
half_w = 0.5 *(np.exp(w_box/w_scale) * w_anchor)
ymin = y_center - half_h
xmin = x_center - half_w
ymax = y_center + half_h
xmax = x_center + half_w
# Rescale coordinate to original image
ymin = int(ymin * image_height)
xmin = int(xmin * image_width)
ymax = int(ymax * image_height)
xmax = int(xmax * image_width)
boxes_filtered.append([ymin, xmin, ymax, xmax])
scores_filtered.append(scores[i][1])
# Remove overlapping boxes
indices = cv2.dnn.NMSBoxes(bboxes=boxes_filtered, scores=scores_filtered, score_threshold=0.7, nms_threshold=0.4)
# Draw remaining bounding boxes
if len(indices) > 0:
for i in indices.flatten():
ymin, xmin, ymax, xmax = boxes_filtered[i]
cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), (0, 255, 0), 4)
# Display the result
cv2.imshow('Image', frame)
# Break the loop if 'q' is pressed
if cv2.waitKey(1) == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
What I don't understand is why I'm getting negative values for bounding boxes and anchors in the output tensors. Coordinates should be positive, right? For instance, here are the raw values (no post-processing applied) after an inference :
Boxes : 0.051099066 -0.17884673 0.051099066 -0.25549534
Anchors : -0.018108593 0.40744334 0.733398 1.1680043
Hello,
I'm trying to use a custom model I've build with the training scripts in this repository to detect coffee cups on a video but I can't figure out how to interprete the output of the model (especially the bounding boxes coordinates). I'm using Python to perform inference.
Here is my script :
When running this script several bounding boxes are drawed at the top left of the windows, even is there is no cup on the image :![Inference_Result](https://github.com/STMicroelectronics/stm32ai-modelzoo/assets/116843639/3b2e2416-7cc9-4800-8755-4e15dd339c72)
The bounding box coordinates I'm receiving seem weird because Ithey include negative values. For instance:
ymin: -0.06561637 xmin: 0.016404092 ymax: 0.14763683 xmax: -0.23785934
I've tried to switch from my custom model to the
ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
model provided in thepretained_models
section of this repository. It behave a bit differently because here when nobody is on the screen (it has been trained to detect persons), no bounding boxes are drawed. However, when a person is present, the bounding boxes still appear in incorrect positions.I believe I'm not interpreting correctly the output of my model or I'm not correctly preprocessing the input images before inference.
Could you please explain, how to correctly use a object detection model ?
As it might help, here is the model properties of my model :![Capture](https://github.com/STMicroelectronics/stm32ai-modelzoo/assets/116843639/a2a7f1ef-63e5-43c6-812e-1ac6ab75bc08)