Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？

wangzishuo029 commented 1 month ago

Recent works like YOLO-World and GroundingDINO mainly use Object365 and GoldG for pretraining. These methods do not use CLIP image encoder as the backbone (unlike some open-vocabulary detection methods like CORA and F-VLM). But the vocabulary of O365 dataset is still limited. So can YOLO-World detect objects beyond the pretraining data? Is YOLO-World a really open-vocabulary detector?

最近的开集目标检测器（例如YOLO-World和GroundingDINO）主要是在Object365和GoldG等大规模数据集上预训练。这些方法没有采用CLIP的图像编码器作为backbone，而一些开放词汇目标检测（OVD）方法例如CORA、F-VLM是直接采用的CLIP图像编码器作为backbone。而YOLO-World的预训练数据的词汇虽然更大，但也是有限的。所以YOLO-World能够检测到预训练数据之外的对象吗？是真正的开放词汇目标检测器吗？

YonghaoHe commented 1 month ago

No, the performance is limited.

FantasticZihao commented 3 days ago

No, the performance is limited.

I am fresh in this area. So i want to ask that when testing on Livis dataset，how to set the classes? Keep the set of training or change it into the classes of livis?

AILab-CVC / YOLO-World

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗？ #484