Object detection - How to use a model ?

cypamigon commented 1 week ago

Hello,

I'm trying to use a custom model I've build with the training scripts in this repository to detect coffee cups on a video but I can't figure out how to interprete the output of the model (especially the bounding boxes coordinates). I'm using Python to perform inference.

Here is my script :

import numpy as np
import tensorflow as tf
import cv2

# Get images stream from the webcam
image_height, image_width = 480, 640
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, image_width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, image_height)

# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="cup_quantized.tflite")
interpreter.allocate_tensors()

# Get input and ouput details of the model
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
model_image_height = input_details[0]['shape'][1]
model_image_width = input_details[0]['shape'][2]

# Process images from video stream
while True:
    ret, frame = cap.read()

    # Preprocess the input image
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame_resized = cv2.resize(frame_rgb, (model_image_width, model_image_height))  # Resize image to match model's expected sizing
    #frame_resized = (frame_resized.astype(np.float32) - 127.5) / 127.5  # Normalize the input image to [-1;1] -> Is it needed?
    input_data = np.expand_dims(frame_resized, axis=0).astype(np.uint8) # Add batch dimension and convert to uint8

    # Set the input tensor
    interpreter.set_tensor(input_details[0]['index'], input_data)

    # Run inference
    interpreter.invoke()

    # Get the output tensor
    scores = interpreter.get_tensor(output_details[0]['index'])[0] 
    boxes = interpreter.get_tensor(output_details[1]['index'])[0] 

    # Loop over all detections and draw detection box if confidence is above minimum threshold
    for i in range(len(boxes)):
        if scores[i][1] > 0.5 :
            # Get bounding box coordinates
            ymin, xmin, ymax, xmax = boxes[i]

            # Interpreter can return coordinates that are outside of image dimensions, need to force them to be within image using max() and min()
            ymin = int(max(1, (ymin * image_height)))
            xmin = int(max(1, (xmin * image_width)))
            ymax = int(min(image_height, (ymax * image_height)))
            xmax = int(min(image_width, (xmax* image_width)))

            # Draw bounding box
            cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), (0, 255, 0), 4)

    # Display the result
    cv2.imshow('Image', frame)

    # Break the loop if 'q' is pressed
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

When running this script several bounding boxes are drawed at the top left of the windows, even is there is no cup on the image : Inference_Result

The bounding box coordinates I'm receiving seem weird because Ithey include negative values. For instance: ymin: -0.06561637 xmin: 0.016404092 ymax: 0.14763683 xmax: -0.23785934

I've tried to switch from my custom model to the ssd_mobilenet_v2_fpnlite_035_416_int8.tflite model provided in the pretained_models section of this repository. It behave a bit differently because here when nobody is on the screen (it has been trained to detect persons), no bounding boxes are drawed. However, when a person is present, the bounding boxes still appear in incorrect positions.

I believe I'm not interpreting correctly the output of my model or I'm not correctly preprocessing the input images before inference.

Could you please explain, how to correctly use a object detection model ?

As it might help, here is the model properties of my model : Capture

cypamigon commented 4 days ago

I've tried to look into the C code at stm32ai_application_code/object_detection/Application/STM32H747I-DISCO/Src/CM7/app_postprocess.c to see how they are post-processing the output of the model using the objdetect_ssd_st_pp_process()function. However, this function comes from a precompiled library so I can't see the content.

In the README of this project, it is specified that some post-processing is required to extract the proper bounding box coordinates: "In the context of Object detections model there are several filtering algorithms to apply at the output of the model in order to get the proper bounding boxes."

Could you provide the required post-processing steps to correctly extract the bounding boxes?

alexandre200 commented 3 days ago

Hi,

the output of ssd mobilenet v2 is expressed in anchors and relative bounding box. What you found so far is the array of bounding box, but this array of bounding box needs to be set relative to anchors.

From what I recall, each final bounding box n can be found using the bbox[n] , anchor[n] and prediction[n].

Where anchor[n] is the window coordinates, bbox[n] is a position relative to the window, here you have the final bbox coordinates ( probably needs to be multiplied by the image size) and prediction[n] is the prediction of your model ( in the bbox box there is a cup or no cup). All bounding boxes can be calculated and are available, but you are only interested in those that have a high prediction probability.

You see that the anchor and bounding boxes are expressed as a vector of N by 4. In each N you will find 4 values that can be interpreted as bounding box coordinates.

The formula can be found in this tensorflow code, commented at line 57. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/detection_postprocess.c

Ssd mobilenet v2 is based arround anchors, each anchor can be considered as a "window" in the image where the model is zooming and looking for a particular object. These windows vary in sizes depending on the model depth ( relative to first input layer). At the first layer where we still have a lot of spatial information, the anchor will be small because we could expect to find the features of a small cup for instance. Has we go deeper in the model, the original images is getting more and more convoluted, hence we lack spatial information but have more meaning, since the we went through multiple layers that extracted the "meaning" within the image, here we search in bigger anchors.

The modelzoo has changed quite a lot I could not find the python training script that also have some pointers to help you decode the output.

cypamigon commented 1 day ago

Hi @alexandre200 ,

Thanks for your help. I've tried to calculate the bounding boxes coordinates using anchors values, as explained in the link you provided :

// Object Detection model produces axis-aligned boxes in two formats:
// BoxCorner represents the upper left corner (xmin, ymin) and
// the lower right corner (xmax, ymax).
// CenterSize represents the center (xcenter, ycenter), height and width.
// BoxCornerEncoding and CenterSizeEncoding are related as follows:
// ycenter = y / y_scale * anchor.h + anchor.y;
// xcenter = x / x_scale * anchor.w + anchor.x;
// half_h = 0.5*exp(h/ h_scale)) * anchor.h;
// half_w = 0.5*exp(w / w_scale)) * anchor.w;
// ymin = ycenter - half_h
// ymax = ycenter + half_h
// xmin = xcenter - half_w
// xmax = xcenter + half_w

Unfortunately, the bounding boxes I'm getting from these calculation are still not relevant. For example, here is the result when I'm using the model ssd_mobilenet_v2_fpnlite_035_416_int8.tflite provided in this repository, which is trained for person detection :

Capture

And here is the Python script :

import numpy as np
import tensorflow as tf
import cv2

y_scale = 10.0
x_scale = 10.0
h_scale = 5.0
w_scale = 5.0

# Get images stream from the webcam
image_height, image_width = 480, 640
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, image_width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, image_height)

# Load TFLite model
interpreter = tf.lite.Interpreter(model_path="ssd_mobilenet_v2_fpnlite_035_416_int8.tflite")
interpreter.allocate_tensors()

# Get input and ouput details of the model
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
model_image_height = input_details[0]['shape'][1]
model_image_width = input_details[0]['shape'][2]

# Process images from video stream
while True:
    ret, frame = cap.read()

    # Preprocess the input image
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame_resized = cv2.resize(frame_rgb, (model_image_width, model_image_height))  # Resize image to match model's expected sizing
    input_data = np.expand_dims(frame_resized, axis=0).astype(np.uint8) # Add batch dimension and convert to uint8

    # Set the input tensor
    interpreter.set_tensor(input_details[0]['index'], input_data)

    # Run inference
    interpreter.invoke()

    # Get the output tensor
    scores = interpreter.get_tensor(output_details[0]['index'])[0] 
    boxes = interpreter.get_tensor(output_details[1]['index'])[0] 
    anchors = interpreter.get_tensor(output_details[2]['index'])[0]

    boxes_filtered = []
    scores_filtered = []

    # Loop over all detections and draw detection box if confidence is above minimum threshold
    for i in range(len(boxes)):
        if scores[i][1] > 0.7 :

            #  Decode the output to get bounding boxes coordinates (ymin, xmin, ymax, xmax) based on the anchors
            y_box, x_box, h_box, w_box = boxes[i]
            y_anchor, x_anchor, h_anchor, w_anchor = anchors[i]

            y_center = y_box/y_scale * h_anchor + y_anchor
            x_center = x_box/x_scale * w_anchor + x_anchor
            half_h = 0.5 * (np.exp(h_box/h_scale) * h_anchor)
            half_w = 0.5 *(np.exp(w_box/w_scale) * w_anchor)

            ymin = y_center - half_h
            xmin = x_center - half_w
            ymax = y_center + half_h
            xmax = x_center + half_w

            # Rescale coordinate to original image
            ymin = int(ymin * image_height)
            xmin = int(xmin * image_width)
            ymax = int(ymax * image_height)
            xmax = int(xmax * image_width)

            boxes_filtered.append([ymin, xmin, ymax, xmax])
            scores_filtered.append(scores[i][1])

    # Remove overlapping boxes
    indices = cv2.dnn.NMSBoxes(bboxes=boxes_filtered, scores=scores_filtered, score_threshold=0.7, nms_threshold=0.4)

    # Draw remaining bounding boxes
    if len(indices) > 0:
        for i in indices.flatten():
            ymin, xmin, ymax, xmax = boxes_filtered[i]
            cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), (0, 255, 0), 4)

    # Display the result
    cv2.imshow('Image', frame)

    # Break the loop if 'q' is pressed
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

What I don't understand is why I'm getting negative values for bounding boxes and anchors in the output tensors. Coordinates should be positive, right? For instance, here are the raw values (no post-processing applied) after an inference :

Boxes :  0.051099066 -0.17884673 0.051099066 -0.25549534
Anchors : -0.018108593 0.40744334 0.733398 1.1680043

STMicroelectronics / stm32ai-modelzoo

Object detection - How to use a model ? #41