AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.61k stars 448 forks source link

Can large-scale pretraining achieve real open-vocabulary? 预训练目标检测器能实现真正的开放词汇吗? #484

Open wangzishuo029 opened 1 month ago

wangzishuo029 commented 1 month ago

Recent works like YOLO-World and GroundingDINO mainly use Object365 and GoldG for pretraining. These methods do not use CLIP image encoder as the backbone (unlike some open-vocabulary detection methods like CORA and F-VLM). But the vocabulary of O365 dataset is still limited. So can YOLO-World detect objects beyond the pretraining data? Is YOLO-World a really open-vocabulary detector?

最近的开集目标检测器(例如YOLO-World和GroundingDINO)主要是在Object365和GoldG等大规模数据集上预训练。这些方法没有采用CLIP的图像编码器作为backbone,而一些开放词汇目标检测(OVD)方法例如CORA、F-VLM是直接采用的CLIP图像编码器作为backbone。而YOLO-World的预训练数据的词汇虽然更大,但也是有限的。所以YOLO-World能够检测到预训练数据之外的对象吗?是真正的开放词汇目标检测器吗?

YonghaoHe commented 1 month ago

No, the performance is limited.

FantasticZihao commented 3 days ago

No, the performance is limited.

I am fresh in this area. So i want to ask that when testing on Livis dataset,how to set the classes? Keep the set of training or change it into the classes of livis?