ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

Processing of GLIP #10

Open Yuzuriha-Inori-x opened 6 months ago

Yuzuriha-Inori-x commented 6 months ago

Hi! Can you explain the specific steps of GLIP in detail? rFeature = item_dict['bbox_features'][:, :, 0, :, :] 'bbox_features' does not look like features extracted through the image-encoder branch of CLIP, but rather looks like features extracted directly by GLIP

Yuzuriha-Inori-x commented 6 months ago

Also, if GLIP is used as the target detection model, will the number of detected objects in a certain frame be less than 10? If this happens, how will it be handled? If this frame is skipped directly, will the number of frames required to meet the object number greater than 10 be insufficient?

ecoxial2007 commented 6 months ago

Typically, we adjust the confidence threshold of the bounding boxes (bbox) to ensure that the number of detected objects is greater than 10. Subsequently, we select the top 10 bounding boxes with the highest confidence. This approach ensures that we do not encounter situations where the number of bounding boxes is insufficient.