InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
1.91k stars 120 forks source link

The model's Grounding capability is still unsatisfactory.(模型的Grounding能力不尽人意) #329

Open WeiminLee opened 3 weeks ago

WeiminLee commented 3 weeks ago

The grounding ability of the fine-tuned model still falls short of meeting production requirements, showing a significant gap compared to the CogAgent model.

examples

68e2fb4e6c95e66c829f8992aa6fb5a1 {"query": " In the photograph, could you pinpoint the location of \"ACHADOS E PERDIDOS\" and tell me its bounding boxes?", "label": "The bounding box is [475, 12, 578, 28]", "response": "The bounding box is [528, 69, 643, 119]"}

a44b4236091c5d3169ae89a3d4e815a2 {"query": " In, can you guide me to the location of \"THE BIG WAVES JOURNAL\" by providing bounding boxes?", "label": "The bounding box is [522, 0, 628, 82]", "response": "The bounding box is [593, 88, 680, 123]"}

fc90916946817c44ca102f46343a3698

{"query": " Help me to locate \"Vinyl Fencing\" in and give me its bounding boxes, please.", "label": "The bounding box is [329, 803, 375, 819]", "response": "The bounding box is [352, 839, 423, 857]"}