FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
483 stars 55 forks source link

Clarify the bounding box format #11

Closed nguyenquivinhquang closed 1 month ago

nguyenquivinhquang commented 1 month ago

Hi there,

Thanks for your wonderful work.

I want to ask about the bbox format. According to this line, I guess our input boxes format is (c_x,c_y,w,h). However, I have checked and found that the format for the input of this: line should be x1,x2,y1,y2.

Could you clarify if I misunderstood anything?

machuofan commented 1 month ago

Thanks for pointing out this. I think you are right - the bboxes need to be transformed from center to corner formats at line. It is surprising to see the model functions well with misaligned bbox input. We are working on debugging this. Please stay tuned for future updates.

liukc19 commented 1 month ago

@nguyenquivinhquang Hello, I noticed that you closed this issue. Has it been resolved?