FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
552 stars 58 forks source link

Can region proposer discover all regions of interest? #26

Closed 2285443514 closed 1 month ago

2285443514 commented 1 month ago

I have found that all user input and output boxes are converted to the boxes output by the region proposer by finding the maximum IoU. What if the output or input box is not in the region proposer output? Can the region proposer guarantee all possible regions?

machuofan commented 1 month ago

The user input boxes are directly passed to the region encoder. Groma merges regions detected by region proposer and user input together, before NMS. To ensure user input boxes are not filtered out, we assign confidence score 1 to these boxes during NMS.

The region proposer is designed to discover the most of the potential target regions,. But still, it is possible that some small objects or object parts are missed by the proposer.