Open DianCh opened 1 year ago
Hi! May I ask why you chose BERT as your text encoder? Why didn't you use the text encoder from CLIP? Thank you!
BERT encoder is better for grounding task (need multi text encodings), and CLIP encoder is usually used in referring task
@rentainhe Thanks! You mentioned in the paper that BERT encoder has a sequence length limit - then how do you evaluate on datasets that have many classes and concatenating all the class names would exceed the limit, for example LVIS?
@rentainhe Thanks! You mentioned in the paper that BERT encoder has a sequence length limit - then how do you evaluate on datasets that have many classes and concatenating all the class names would exceed the limit, for example LVIS?
We split LVIS classes into multi-parts, and inference them in multi-times to compute the final results.
So it takes much longer time to evaluate LVIS than COCO
Hi! May I ask why you chose BERT as your text encoder? Why didn't you use the text encoder from CLIP? Thank you!