Tested some images and felt that the grounding ability was weakened a lot compared to the original DINO？

FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization

https://groma-mllm.github.io/

Apache License 2.0

483 stars 55 forks source link

Closed TiantZhang closed 1 month ago

machuofan commented 1 month ago

Could you please provide more details on this issue?