Florence-2 vs Grounding DINO + SAM2

IDEA-Research / Grounded-SAM-2

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2

Apache License 2.0

701 stars 48 forks source link

Hi @radames

Your observation is very thorough, and the questions you've raised are highly valuable.

We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.

For Grounding DINO 1.5, we can see its zero-shot detection capability is stronger than Florence-2, which achieves zero-shot 54.3 AP and 55.7 AP on LVIS minival, and Florence-2 achieves 43.4 AP on COCO zero-shot benchmark.

But after training on FLD-5B datasets, Florence-2 can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:

And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.

IDEA-Research / Grounded-SAM-2

Florence-2 vs Grounding DINO + SAM2 #34