[Question] mismatch between bbox and image in RefCOCO

Hi authors,

Thanks for your great job!

However, for the evaluation in Visual Grounding (Refcoco/+/g), I find that the coordinate of your normalized bbox mismatch with the image processed by LLaVA1.5.

Specifically, your code for bbox normalize the bbox based on the original image size. Instead, the image will go through LetterBoxPad and resize to 336px. Therefore, the normalized bbox's coordinates don't match the pixels' coordinates of input image in LLaVA.

Isn't it a problem? Is it the same way of how LLaVA generate their training data?

TRI-ML / vlm-evaluation

[Question] mismatch between bbox and image in RefCOCO #16