Open roapple10 opened 1 year ago
Bump
Hi, I made this code snippet for visual feature extraction. Unfortunately, the results obtained on the ScienceQA dataset differ (slightly) from those present in this repository. Despite this, the results obtained are consistent in size and allow the execution of both classification and rationale generation. Hope it can be useful.
from transformers import AutoImageProcessor, DetrForObjectDetection
from PIL import Image
import torch
pretrained_model = "facebook/detr-resnet-101-dc5"
image_processor = AutoImageProcessor.from_pretrained(pretrained_model)
model = DetrForObjectDetection.from_pretrained(pretrained_model)
image_path = "img.jpg"
image = Image.open(image_path)
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# the last hidden states are the final query embeddings of the Transformer decoder
vision_features = outputs.last_hidden_state.numpy()
Thanks the author for this awesome work!
Some questions in the dataset contain both image of question and images of the choices. I was wondering how the author get the visual features in this case. Are there some pooling funtion applied?
How do you deal with this case, Francesco-Ranieri?
As long as i understood by their implementation, always one image features vector is used for each question. Being the code of the vision features generation not available, we need an answer from the authors to know if any pooling function was applied. However, i honestly think that only one image was taken into consideration.
Same opinion as yours. But I found that there are more features in .npy than questions which have image contexts. So I open another issue with respect to it. #46
I would like to study more about the Vision Features, is it convenient to share the coding part to generate the npy file? Much appreciate the hard work here.