Open SWXxing opened 9 months ago
Thank you for your interest in my work. I have uploaded the code for processing Next-QA with GLIP, which can be found in src/tools/README.md
and src/tools/extract_glip_bboxes.py
.
The item_dict['video_features']
refers to the CLIP's [cls]
token, and item_dict['bbox_features']
represents the features extracted by CLIP after GLIP's bbox extraction.
It's a bit cumbersome, but necessary since GLIP's ROI features cannot be directly extracted. Furthermore, even if extracted, they wouldn't align with CLIP's BERT representation.
Thank you for your interest in my work. I have uploaded the code for processing Next-QA with GLIP, which can be found in
src/tools/README.md
andsrc/tools/extract_glip_bboxes.py
.The
item_dict['video_features']
refers to the CLIP's[cls]
token, anditem_dict['bbox_features']
represents the features extracted by CLIP after GLIP's bbox extraction.It's a bit cumbersome, but necessary since GLIP's ROI features cannot be directly extracted. Furthermore, even if extracted, they wouldn't align with CLIP's BERT representation.
Thank you for your reply. Isee that the extraxt_embedding.py loaded pre-train clip model to get an image_features of an image. How to extract the region features of the generated bboxes from glip does not provide processing details. Can you provide details on how to use clip to extract feature representations of regions for given bboxes?
Using clip to extract features from the bbox region is straightforward. You can crop the original image using OpenCV or Pillow, and then use clip to extract features from the corresponding bbox region.
hi, can you provide the code for extracting region features by clip after GLIP's bbox extraction?
Sorry for the late response, I've been quite busy lately. Here is a simple example code:
x1, y1, x2, y2 = bbox
cropped_image = image_np[y1:y2, x1:x2]
cropped_image_pil = Image.fromarray(cropped_image)
tensor = val_transform(cropped_image_pil)
output = model.encode_image(tensor)
Recently, I've been too busy. If I find the time, I will organize and provide the complete code.
Thanks for doing such a great job! two issues: 1) Is item_dict['video_features'] supposed to be obtained via a pre-trained CLIP? 2) Is item_dict['bbox_features'] obtained via pre-trained GLIP?
When will the extraction feature embedding code for 'bbox_features' using pre-trained GLIP be available?