Closed Hannibal046 closed 7 months ago
Hey @Hannibal046 -- so this is a great question. Ultimately, this was just a design choice we made, folding all preprocessing/formatting logic into the MapDataset
classes to just make sure that prompt formatting and image preprocessing was standardized across the models we evaluated in our paper.
I totally think that we should refactor/update this API to support other types of VLMs that do more non-trivial preprocessing. Do you have a recommendation in mind (beyond making the full switch to the generic IndexDataset
)? What sort of issues have you been running into with integrating Qwen-VL (how can I help)?
Hi,
Thank you for the detailed explanation! Currently I could only come up with the idea to switch to IndexDataset
primarily for the following reasons:
Regarding the Qwen-VL model, the official demo shows that its integration into the vlm-eval is not very straightforward:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# First dialogue turn
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
{'text': 'What is this?'}
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response) # Example output in Chinese, showing a woman playing with a dog on the beach.
# Second dialogue turn
response, history = model.chat(tokenizer, 'Mark the clapping position in the picture', history=history)
print(response)
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
image.save('1.jpg')
else:
print("No bounding box found.")
I fully respect your work and only propose a suggestion for your consideration. Any further discussion is welcomed!
Hi, Teams, Thanks so much for this great work! I am currently trying to add more trained VL models to this repo. And I have a question about the Dataset Type used in evaluation. In
textvqa.py
, there are two dataset types: https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L123 https://github.com/TRI-ML/vlm-evaluation/blob/2092905d392e8dbedf01ed4b853df530e3cf9f35/vlm_eval/tasks/harnesses/textvqa.py#L142And the currently used in the latter. However, I encounter several problems when trying to adapt different VLMs to this framework because different VLMs have different implementation of how image is encoded and how encoded visual features is integrated into the LLM. And some models doesn't even have explicit
image_processor
(e.g. QWen-VL). So my question is,MapDataset
is preferred here? Because I believe a more general solution is to give batchedquestions
andimages
to VLM'sgenerate_answers
methods and let the model itself decide how to process (just like every VLMs demo code shows).