amazon-science / mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
https://arxiv.org/abs/2302.00923
Apache License 2.0
3.8k stars 314 forks source link

Question :The code to generate Vision Features #35

Open roapple10 opened 1 year ago

roapple10 commented 1 year ago

I would like to study more about the Vision Features, is it convenient to share the coding part to generate the npy file? Much appreciate the hard work here.

gianfrancodemarco commented 1 year ago

Bump

Francesco-Ranieri commented 1 year ago

Hi, I made this code snippet for visual feature extraction. Unfortunately, the results obtained on the ScienceQA dataset differ (slightly) from those present in this repository. Despite this, the results obtained are consistent in size and allow the execution of both classification and rationale generation. Hope it can be useful.

from transformers import AutoImageProcessor, DetrForObjectDetection
from PIL import Image
import torch

pretrained_model = "facebook/detr-resnet-101-dc5"
image_processor = AutoImageProcessor.from_pretrained(pretrained_model)
model = DetrForObjectDetection.from_pretrained(pretrained_model)

image_path = "img.jpg"
image = Image.open(image_path)
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs) 

# the last hidden states are the final query embeddings of the Transformer decoder
vision_features = outputs.last_hidden_state.numpy()
aiPenguin commented 1 year ago

Thanks the author for this awesome work!

Some questions in the dataset contain both image of question and images of the choices. I was wondering how the author get the visual features in this case. Are there some pooling funtion applied?

How do you deal with this case, Francesco-Ranieri?

Francesco-Ranieri commented 1 year ago

As long as i understood by their implementation, always one image features vector is used for each question. Being the code of the vision features generation not available, we need an answer from the authors to know if any pooling function was applied. However, i honestly think that only one image was taken into consideration.

aiPenguin commented 1 year ago

Same opinion as yours. But I found that there are more features in .npy than questions which have image contexts. So I open another issue with respect to it. #46