FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
544 stars 57 forks source link

About grouding output #12

Closed nguyenquivinhquang closed 4 months ago

nguyenquivinhquang commented 4 months ago

Thanks for your wonderful work, I want to ask which part of the code corresponds to the grounding output in section 3.2 of the paper. image

machuofan commented 4 months ago

Thanks for your interest in our work. The codes of preparing grounded output can be founded in dataset definitions, e.g., flicker.py.