Grounding-DINO occupies the majority of Grounded-SAM's processing time.

IDEA-Research / Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

https://arxiv.org/abs/2401.14159

Apache License 2.0

14.88k stars 1.38k forks source link

Grounding-DINO occupies the majority of Grounded-SAM's processing time. #421

Open xiaobanni opened 9 months ago

xiaobanni commented 9 months ago

Thank you for your excellent work on the Grounded-Segment-Anything project. I've noticed that developers have recently incorporated various advanced SAM models, such as Efficient-SAM and RepViT-SAM. However, it appears that the Grounding-DINO module consumes most of the processing time in Grounded-SAM. As illustrated in the attached picture, while MobileSAM takes only 0.05s, Grounding-DINO requires 1.70s, which is significantly longer. Are there any plans to optimize the Grounding-DINO module, or is there an already available off-the-shelf solution?

rentainhe commented 9 months ago

Hello! For now, we do not have a smaller version of Grounding-DINO, you may replace grounding-dino with other light open-world models as the box prompt generator from the community.

xiaobanni commented 9 months ago

@rentainhe Thank you for your quick and friendly response. As I am not a professional in the field of Image segmentation, but just want to use its technology in downstream applications. After researching, I didn't find any significantly usable alternatives to Grounding-DINO. Could you recommend some potential solutions for me to try? Also, I found that this need might be common, as evidenced by the widespread discussion in the following link.

HaoqianSong commented 3 months ago

Does GLIP have the same functions and effects? Compared with Grounding-DINO, can GLIP be seen as a combination of Grounding-DINO detector and BLIP? GLIP seems to have the functions of arbitrary text retrieval and object localization. Does it have the function of image description text output?