about Fig. 4, ImagePoolingAttentionModule

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

https://www.yoloworld.cc

GNU General Public License v3.0

4.4k stars 426 forks source link

about Fig. 4, ImagePoolingAttentionModule #85

Closed Outlying3720 closed 6 months ago

Outlying3720 commented 6 months ago

Hello! I found you draw two ImagePoolingAttentionModule in Fig. 4, but in the code, https://github.com/AILab-CVC/YOLO-World/blob/1bb36a8e6cab76e190cb466eef160aa3f26a49cf/yolo_world/models/necks/yolo_world_pafpn.py#L219, it seems ImagePoolingAttentionModule only run once.

Since Image-aware Embeddings are never used in latter inference (You use original text feat to do cosine distance matching, right?), the second ImagePoolingAttentionModule in Fig. 4 is no need?

wondervictor commented 6 months ago

Hi @Outlying3720, thanks for your question and it's very helpful! We have already found this issue and we have optimized it in the next version (The YOLO-World-v2 is coming soon). The next version achieves better performance with higher speed.

Outlying3720 commented 6 months ago

Wow, looking for YOLO world v2!