IDEA-Research / Grounded-SAM-2

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
https://arxiv.org/abs/2401.14159
Apache License 2.0
701 stars 48 forks source link

Florence-2 vs Grounding DINO + SAM2 #34

Open radames opened 2 weeks ago

radames commented 2 weeks ago

Hello, thanks for the awesome collection of demos and code. I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet. At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.

rentainhe commented 2 weeks ago

Hi @radames

Your observation is very thorough, and the questions you've raised are highly valuable.

We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.

For Grounding DINO 1.5, we can see its zero-shot detection capability is stronger than Florence-2, which achieves zero-shot 54.3 AP and 55.7 AP on LVIS minival, and Florence-2 achieves 43.4 AP on COCO zero-shot benchmark.

But after training on FLD-5B datasets, Florence-2 can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:

image

And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.