However, for the evaluation in Visual Grounding (Refcoco/+/g), I find that the coordinate of your normalized bbox mismatch with the image processed by LLaVA1.5.
Specifically, your code for bbox normalize the bbox based on the original image size. Instead, the image will go through LetterBoxPad and resize to 336px. Therefore, the normalized bbox's coordinates don't match the pixels' coordinates of input image in LLaVA.
Isn't it a problem? Is it the same way of how LLaVA generate their training data?
According to issue in LLaVA codebase, the bbox in their training data has considered the LetterBoxPad and 336px. Therefore, I think your preparation in ground truth bbox might not be correct. How do you think?
Hi authors,
Thanks for your great job!
However, for the evaluation in Visual Grounding (Refcoco/+/g), I find that the coordinate of your normalized bbox mismatch with the image processed by LLaVA1.5.
Specifically, your code for bbox normalize the bbox based on the original image size. Instead, the image will go through LetterBoxPad and resize to 336px. Therefore, the normalized bbox's coordinates don't match the pixels' coordinates of input image in LLaVA.
Isn't it a problem? Is it the same way of how LLaVA generate their training data?