TRI-ML / vlm-evaluation

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
Other
64 stars 7 forks source link

Question about the Dataset Type #8

Closed Hannibal046 closed 2 months ago

Hannibal046 commented 2 months ago

Hi, Teams, Thanks so much for this great work! I am currently trying to add more trained VL models to this repo. And I have a question about the Dataset Type used in evaluation. In textvqa.py, there are two dataset types: https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L123 https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L142

And the currently used in the latter. However, I encounter several problems when trying to adapt different VLMs to this framework because different VLMs have different implementation of how image is encoded and how encoded visual features is integrated into the LLM. And some models doesn't even have explicit image_processor (e.g. QWen-VL). So my question is,

siddk commented 2 months ago

Hey @Hannibal046 -- so this is a great question. Ultimately, this was just a design choice we made, folding all preprocessing/formatting logic into the MapDataset classes to just make sure that prompt formatting and image preprocessing was standardized across the models we evaluated in our paper.

I totally think that we should refactor/update this API to support other types of VLMs that do more non-trivial preprocessing. Do you have a recommendation in mind (beyond making the full switch to the generic IndexDataset)? What sort of issues have you been running into with integrating Qwen-VL (how can I help)?

Hannibal046 commented 2 months ago

Hi,

Thank you for the detailed explanation! Currently I could only come up with the idea to switch to IndexDataset primarily for the following reasons:

Regarding the Qwen-VL model, the official demo shows that its integration into the vlm-eval is not very straightforward:


from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

# First dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'What is this?'}
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)  # Example output in Chinese, showing a woman playing with a dog on the beach.

# Second dialogue turn
response, history = model.chat(tokenizer, 'Mark the clapping position in the picture', history=history)
print(response)
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("No bounding box found.")

I fully respect your work and only propose a suggestion for your consideration. Any further discussion is welcomed!