OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的可商用开源多模态对话模型
https://internvl.github.io/
MIT License
4.37k stars 338 forks source link

How InternVL2-26Bwas evaluated? #330

Open Iven2132 opened 2 weeks ago

Iven2132 commented 2 weeks ago

Hi, I'm confused I did some visual answering with the InternVL2-26B model and it performs very badly in that. The only model that passes that question are Gemini 1.5 pro/flash, gpt4-o, and Claude.

Then how InternVL2-26B was evaluated? That it outperforms gpt4-v, Gemini 1.5?

ErfeiCui commented 2 weeks ago

You can refer to our documentation for model evaluation: https://github.com/OpenGVLab/InternVL/blob/main/README.md#documents. Also, if possible, could you provide some examples of errors?

noahdasanaike commented 2 weeks ago

Also getting poor performance here. As an example, when prompted,

Analyze the voting results table image and return a JSON object with this structure: {\"candidates\": [{\"name\": \"Candidate Name\", \"votes\": [{\"polling_station\": number, \"votes\": number}, ...]},...]} Extract votes for each polling station for all candidates. The table may be rotated; use the polling station numbers to determine the correct order of votes.

on the following image,

test

it returns made-up numbers and candidate names.

I'm running with

pixel_values = load_image('test.png').to(torch.bfloat16).cuda()

generation_config = dict(
    num_beams=1,
    max_new_tokens=8096,
    do_sample=False,
)