The extraction embedding coding of "bbox_features"

ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering

4 stars 2 forks source link

The extraction embedding coding of "bbox_features" #6

Open SWXxing opened 9 months ago

SWXxing commented 9 months ago

Thanks for doing such a great job！ two issues: 1) Is item_dict['video_features'] supposed to be obtained via a pre-trained CLIP? 2) Is item_dict['bbox_features'] obtained via pre-trained GLIP?

When will the extraction feature embedding code for 'bbox_features' using pre-trained GLIP be available?

ecoxial2007 commented 9 months ago

Thank you for your interest in my work. I have uploaded the code for processing Next-QA with GLIP, which can be found in src/tools/README.md and src/tools/extract_glip_bboxes.py.

The item_dict['video_features'] refers to the CLIP's [cls] token, and item_dict['bbox_features'] represents the features extracted by CLIP after GLIP's bbox extraction.

It's a bit cumbersome, but necessary since GLIP's ROI features cannot be directly extracted. Furthermore, even if extracted, they wouldn't align with CLIP's BERT representation.

SWXxing commented 9 months ago

Thank you for your interest in my work. I have uploaded the code for processing Next-QA with GLIP, which can be found in src/tools/README.md and src/tools/extract_glip_bboxes.py.

The item_dict['video_features'] refers to the CLIP's [cls] token, and item_dict['bbox_features'] represents the features extracted by CLIP after GLIP's bbox extraction.

It's a bit cumbersome, but necessary since GLIP's ROI features cannot be directly extracted. Furthermore, even if extracted, they wouldn't align with CLIP's BERT representation.

Thank you for your reply. Isee that the extraxt_embedding.py loaded pre-train clip model to get an image_features of an image. How to extract the region features of the generated bboxes from glip does not provide processing details. Can you provide details on how to use clip to extract feature representations of regions for given bboxes?

ecoxial2007 commented 9 months ago

Using clip to extract features from the bbox region is straightforward. You can crop the original image using OpenCV or Pillow, and then use clip to extract features from the corresponding bbox region.

bxwldljh commented 8 months ago

hi, can you provide the code for extracting region features by clip after GLIP's bbox extraction?

ecoxial2007 commented 8 months ago

Sorry for the late response, I've been quite busy lately. Here is a simple example code:

x1, y1, x2, y2 = bbox
cropped_image = image_np[y1:y2, x1:x2]
cropped_image_pil = Image.fromarray(cropped_image)
tensor = val_transform(cropped_image_pil)
output = model.encode_image(tensor)

Recently, I've been too busy. If I find the time, I will organize and provide the complete code.