TRI-ML / vlm-evaluation

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
Other
89 stars 10 forks source link

[Question] mismatch between bbox and image in RefCOCO #16

Open WeitaiKang opened 1 month ago

WeitaiKang commented 1 month ago

Hi authors,

Thanks for your great job!

However, for the evaluation in Visual Grounding (Refcoco/+/g), I find that the coordinate of your normalized bbox mismatch with the image processed by LLaVA1.5.

Specifically, your code for bbox normalize the bbox based on the original image size. Instead, the image will go through LetterBoxPad and resize to 336px. Therefore, the normalized bbox's coordinates don't match the pixels' coordinates of input image in LLaVA.

Isn't it a problem? Is it the same way of how LLaVA generate their training data?

WeitaiKang commented 1 month ago

According to issue in LLaVA codebase, the bbox in their training data has considered the LetterBoxPad and 336px. Therefore, I think your preparation in ground truth bbox might not be correct. How do you think?