FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
568 stars 61 forks source link

Referring multiple regions in the image #35

Open Deepayan137 opened 1 week ago

Deepayan137 commented 1 week ago

Hi,

Thank you for your excellent work. I have been playing with the run_grom.py file and was wondering whether it is possible to provide multiple region bounding boxes to the model and ask it to describe them together. I was looking at the qualitative examples, and we can provide only one bounding box as an input to the model. Can you please tell me whether we can provide multiple region bounding boxes as an input, and if we can, can you provide a short example of how to do it?

Thank you

machuofan commented 1 week ago

Yes, this framework theoretically supports multiple referring regions as input. For example, you can do this by prompting the model with Please briefly describe <roi><refer_box></roi> <refer_feat> and <roi><refer_box></roi> <refer_feat> and setting the box coordinates here.

However, it is possible that you get unexpected answers. This is because the provided model has not been trained on data with multiple referring regions as input. Anyway, feel free to have a try.

Deepayan137 commented 2 days ago

Thank you for the reply. So if we refer to multiple regions then we pass a list of tensor (normalized bounding box co-ordinates)?