IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.7k stars 681 forks source link

About text encoder #194

Open DianCh opened 1 year ago

DianCh commented 1 year ago

Hi! May I ask why you chose BERT as your text encoder? Why didn't you use the text encoder from CLIP? Thank you!

rentainhe commented 1 year ago

Hi! May I ask why you chose BERT as your text encoder? Why didn't you use the text encoder from CLIP? Thank you!

BERT encoder is better for grounding task (need multi text encodings), and CLIP encoder is usually used in referring task

DianCh commented 1 year ago

@rentainhe Thanks! You mentioned in the paper that BERT encoder has a sequence length limit - then how do you evaluate on datasets that have many classes and concatenating all the class names would exceed the limit, for example LVIS?

rentainhe commented 1 year ago

@rentainhe Thanks! You mentioned in the paper that BERT encoder has a sequence length limit - then how do you evaluate on datasets that have many classes and concatenating all the class names would exceed the limit, for example LVIS?

We split LVIS classes into multi-parts, and inference them in multi-times to compute the final results.

So it takes much longer time to evaluate LVIS than COCO