FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
574 stars 61 forks source link

Bounding box format #21

Closed liukc19 closed 4 months ago

liukc19 commented 4 months ago

Thank you for your great work. I found that groma performs poorly on some object detection tasks, which makes it difficult for me to determine whether the problem occurs in the inference phase or the detection phase when using groma to do some complex VQA tasks. Do you plan to fix the bug of misaligned bounding box format when extracting region feature?

machuofan commented 4 months ago

Hi there, you can check if the target boxes are missed by the region proposer by commenting out the following lines, which draws every single box detected by ddetr after nms and thresholding.

I have tried re-training the model after changing the box format from [cx, cy, w, h] to [x1, y1, x2, y2] for roi_align input, as reported in #11. But the performances decreased on all benchmarks. Still working on how could this happen...

Ethan-yh commented 3 months ago

Hi, did you find out why the performance of the model decreased after aligning the bounding boxes? That is really weird. @machuofan