Open 2hosun opened 2 weeks ago
I think the explanation is insufficient, so to put it another way
There is a text encoder in the Overall Architecture of YOLO-World picture of the paper, and I want to know where it is implemented in the model, and I know that the text encoder and image encoder are contrastive learning as a clip model to extract vectors, which meant that I wanted to know about this image encoder.
I think the explanation is insufficient, so to put it another way
There is a text encoder in the Overall Architecture of YOLO-World picture of the paper, and I want to know where it is implemented in the model, and I know that the text encoder and image encoder are contrastive learning as a clip model to extract vectors, which meant that I wanted to know about this image encoder.
yolo_world/models/backbones/mm_backbone.py. Here use a pretrained CLIP model
Hi, I'm looking at your yolo world and feeling like it's a really creative idea. So looking at the code, I want to know how the text encoder and image encoder used in yolo world were used, how can I check it?
If my questions are too broad, you can explain configurs/pretrain/yolo_world_v2_l_cliffe_large_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py as an example.
I'd appreciate it if you could give me an answer.