Question about the Dataset Type

Hannibal046 commented 7 months ago

Hi, Teams, Thanks so much for this great work! I am currently trying to add more trained VL models to this repo. And I have a question about the Dataset Type used in evaluation. In textvqa.py, there are two dataset types: https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L123 https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L142

And the currently used in the latter. However, I encounter several problems when trying to adapt different VLMs to this framework because different VLMs have different implementation of how image is encoded and how encoded visual features is integrated into the LLM. And some models doesn't even have explicit image_processor (e.g. QWen-VL). So my question is,

Why MapDataset is preferred here? Because I believe a more general solution is to give batched questions and images to VLM's generate_answers methods and let the model itself decide how to process (just like every VLMs demo code shows).

siddk commented 7 months ago

Hey @Hannibal046 -- so this is a great question. Ultimately, this was just a design choice we made, folding all preprocessing/formatting logic into the MapDataset classes to just make sure that prompt formatting and image preprocessing was standardized across the models we evaluated in our paper.

I totally think that we should refactor/update this API to support other types of VLMs that do more non-trivial preprocessing. Do you have a recommendation in mind (beyond making the full switch to the generic IndexDataset)? What sort of issues have you been running into with integrating Qwen-VL (how can I help)?

Hannibal046 commented 7 months ago

Hi,

Thank you for the detailed explanation! Currently I could only come up with the idea to switch to IndexDataset primarily for the following reasons:

Data-Model Separation: This approach further separates data and model. Each task runner would only need to supply a prompt and an image path. This means we won't have to modify every task runner each time a processing method is integrated. https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L168-L176
Convenience for VLM Demos: Each VLM demo manages scenarios where an image path and a prompt are provided, making it convenient to integrate additional models, even though this might be a less standardized approach. I've successfully integrated several models in this framework, which you can see here.

Regarding the Qwen-VL model, the official demo shows that its integration into the vlm-eval is not very straightforward:


from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

# First dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'What is this?'}
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)  # Example output in Chinese, showing a woman playing with a dog on the beach.

# Second dialogue turn
response, history = model.chat(tokenizer, 'Mark the clapping position in the picture', history=history)
print(response)
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("No bounding box found.")

I fully respect your work and only propose a suggestion for your consideration. Any further discussion is welcomed!

TRI-ML / vlm-evaluation

Question about the Dataset Type #8