AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.67k stars 453 forks source link

Weights for SimpleYOLOWorldDetector #338

Open pelinsuacar opened 6 months ago

pelinsuacar commented 6 months ago

Hello,

Could you provide the weights to try SimpleYOLOWorldDetector by giving image embedding as an input instead of text? When I load from "yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth", it gives an error:

Loads checkpoint by local backend from path: yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.text_model.model.text_model.embeddings.token_embedding.weight, backbone.text_model.model.text_model.embeddings.position_embedding.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.0.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.0.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.0.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.1.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.1.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.1.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.2.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.2.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.2.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.3.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.3.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.3.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.4.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.4.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.4.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.5.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.5.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.5.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.6.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.6.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.6.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.7.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.7.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.7.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.8.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.8.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.8.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.9.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.9.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.9.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.10.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.10.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.10.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.11.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.11.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.11.layer_norm2.bias, backbone.text_model.model.text_model.final_layer_norm.weight, backbone.text_model.model.text_model.final_layer_norm.bias, backbone.text_model.model.text_projection.weight

missing keys in source state_dict: embeddings

The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.text_model.model.text_model.embeddings.token_embedding.weight, backbone.text_model.model.text_model.embeddings.position_embedding.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.0.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.0.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.0.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.0.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.0.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.0.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.1.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.1.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.1.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.1.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.1.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.1.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.2.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.2.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.2.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.2.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.2.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.2.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.3.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.3.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.3.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.3.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.3.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.3.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.4.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.4.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.4.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.4.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.4.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.4.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.5.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.5.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.5.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.5.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.5.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.5.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.6.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.6.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.6.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.6.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.6.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.6.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.7.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.7.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.7.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.7.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.7.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.7.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.8.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.8.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.8.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.8.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.8.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.8.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.9.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.9.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.9.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.9.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.9.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.9.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.10.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.10.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.10.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.10.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.10.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.10.layer_norm2.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.k_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.k_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.v_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.v_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.q_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.q_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.self_attn.out_proj.weight, backbone.text_model.model.text_model.encoder.layers.11.self_attn.out_proj.bias, backbone.text_model.model.text_model.encoder.layers.11.layer_norm1.weight, backbone.text_model.model.text_model.encoder.layers.11.layer_norm1.bias, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc1.weight, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc1.bias, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc2.weight, backbone.text_model.model.text_model.encoder.layers.11.mlp.fc2.bias, backbone.text_model.model.text_model.encoder.layers.11.layer_norm2.weight, backbone.text_model.model.text_model.encoder.layers.11.layer_norm2.bias, backbone.text_model.model.text_model.final_layer_norm.weight, backbone.text_model.model.text_model.final_layer_norm.bias, backbone.text_model.model.text_projection.weight

missing keys in source state_dict: embeddings

wondervictor commented 6 months ago

Firstly, please strop from using yolow-v8_l_clipv2_frozen_t2iv2_bn_o365_goldg_pretrain.pth from the HuggingFace Demo. If you use the code in this repo, please use the pre-trained weights in this repo. The two versions have slight differences.

Secondly, the SimpleYOLOWorldDetector is designed for prompt tuning and re-parameterized version, please check more in docs/reparameterize and docs/prompt_yolo_world,

pelinsuacar commented 6 months ago

So, should I use YOLOWorldPromptDetector instead? But there is no such detector in YOLO-World/yolo_world/models/detectors /yolo_world.py?

wondervictor commented 6 months ago

YOLOWorldPromptDetector has been deprecated. Please use SimpleYOLOWorldDetector instead. However, you might need to refer to: fine-tuning-yolo-world to determine.

pelinsuacar commented 5 months ago

okay I am a bit confused. Is it possible to test the zero shot inference with embedding instead of text as a first step? I just want to give the embedding of an object as input to get rid of the language model and to see if it will be able to detect that object in my target image. For this, I need to use the SimpleYOLOWorldDetector if I understand correctly. Because YoloWorldDetector has the language model but I couldn't find a way to initialize my SimpleYOLOWorldDetector with appopriate weights. Could you please explain if that's possible or not. Thank you!

pelinsuacar commented 5 months ago

so my question is where I can find 'pretrained_models/yolo_world_l_clip_t2i_bn_2e-3adamw_32xb16-100e_obj365v1_goldg_cc3mlite_train-ca93cd1f.pth' that is specified in one of the config files of SimpleYOLOWorldDetector?

wondervictor commented 5 months ago

Please check the following model zoo to check the pre-trained weights: https://github.com/AILab-CVC/YOLO-World?tab=readme-ov-file#zero-shot-inference-on-lvis-dataset

wondervictor commented 5 months ago

BTW, SimpleYOLOWorldDetector is a general detector class and does not have a specific pre-trained weight.

pelinsuacar commented 5 months ago

after fine-tuning the reparametrized SimpleYOLOWorldDetector, how can I test zero shot inference? Is it possible to give both target image and an embedding of an object that is needed to be detected in the target image as inputs to my model during inference time? Since the precomputed text embeddings are converted into the weights of certain layers, there is no dynamic text embeddings during inference as it relies on the precomputed/integrated embeddings. So how we can say that "Reparameterized YOLO-World still has zero-shot ability" in this case? @wondervictor