The grounding ability of the fine-tuned model still falls short of meeting production requirements, showing a significant gap compared to the CogAgent model.
examples
{"query": " In the photograph, could you pinpoint the location of \"ACHADOS E PERDIDOS\" and tell me its bounding boxes?", "label": "The bounding box is [475, 12, 578, 28]", "response": "The bounding box is [528, 69, 643, 119]"}
{"query": " In, can you guide me to the location of \"THE BIG WAVES JOURNAL\" by providing bounding boxes?", "label": "The bounding box is [522, 0, 628, 82]", "response": "The bounding box is [593, 88, 680, 123]"}
{"query": " Help me to locate \"Vinyl Fencing\" in and give me its bounding boxes, please.", "label": "The bounding box is [329, 803, 375, 819]", "response": "The bounding box is [352, 839, 423, 857]"}
The grounding ability of the fine-tuned model still falls short of meeting production requirements, showing a significant gap compared to the CogAgent model.
examples
{"query": " Help me to locate \"Vinyl Fencing\" in and give me its bounding boxes, please.", "label": "The bounding box is [329, 803, 375, 819]", "response": "The bounding box is [352, 839, 423, 857]"}