How do you know the text encoder and image encoder used in yolo world

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

https://www.yoloworld.cc

GNU General Public License v3.0

4.34k stars 418 forks source link

How do you know the text encoder and image encoder used in yolo world #474

Open 2hosun opened 2 weeks ago

2hosun commented 2 weeks ago

Hi, I'm looking at your yolo world and feeling like it's a really creative idea. So looking at the code, I want to know how the text encoder and image encoder used in yolo world were used, how can I check it?

If my questions are too broad, you can explain configurs/pretrain/yolo_world_v2_l_cliffe_large_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py as an example.

I'd appreciate it if you could give me an answer.

2hosun commented 2 weeks ago

I think the explanation is insufficient, so to put it another way

There is a text encoder in the Overall Architecture of YOLO-World picture of the paper, and I want to know where it is implemented in the model, and I know that the text encoder and image encoder are contrastive learning as a clip model to extract vectors, which meant that I wanted to know about this image encoder.

ZC0102-shu commented 9 hours ago

I think the explanation is insufficient, so to put it another way

There is a text encoder in the Overall Architecture of YOLO-World picture of the paper, and I want to know where it is implemented in the model, and I know that the text encoder and image encoder are contrastive learning as a clip model to extract vectors, which meant that I wanted to know about this image encoder.

yolo_world/models/backbones/mm_backbone.py. Here use a pretrained CLIP model