Open Yuzuriha-Inori-x opened 6 months ago
Also, if GLIP is used as the target detection model, will the number of detected objects in a certain frame be less than 10? If this happens, how will it be handled? If this frame is skipped directly, will the number of frames required to meet the object number greater than 10 be insufficient?
Typically, we adjust the confidence threshold of the bounding boxes (bbox) to ensure that the number of detected objects is greater than 10. Subsequently, we select the top 10 bounding boxes with the highest confidence. This approach ensures that we do not encounter situations where the number of bounding boxes is insufficient.
Hi! Can you explain the specific steps of GLIP in detail?
rFeature = item_dict['bbox_features'][:, :, 0, :, :]
'bbox_features' does not look like features extracted through the image-encoder branch of CLIP, but rather looks like features extracted directly by GLIP