AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.4k stars 426 forks source link

关于ImagePoolingAttentionModule的调用 #255

Open H1NATA111 opened 5 months ago

H1NATA111 commented 5 months ago

感谢您的杰出工作! 在configs/pretrain/yolo_world_v2_l_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py这个config文件下 我尝试通过断点了解您的模型, 但在训练过程中,模型并未调用ImagePoolingAttentionModule模块 即论文中提及的“Image-Pooling Attention”模块。 请问在什么情况下模型会调用这个模块来更新文本特征呢?

wondervictor commented 5 months ago

Hi @H1NATA111, considering that the I-PoolingAttention and L2-norm are not efficient enough for TensorRT models, we have removed them for the new version (YOLO-World-V2). All pre-trained checkpoints have been released and we suggest you move to the latest version of YOLO-World.

ljj7975 commented 4 months ago

Q1. How much role does I-Pooling play from the final performance perspective? Wouldn't the performance degrade if the I-pooling is dropped? Q2. Am I right that, without I-Pooling, the text features are untouched in Yolo-World? i.e. same as CLIP text embedding.

ljj7975 commented 4 months ago

Based on the performance reported from readme, I-pooling doesn't seem to help. Please correct me if I am missing anything.

v2 image

v1 image

wondervictor commented 4 months ago

Hi @ljj7975, Adding I-PoolingAttention is effective for pre-training with large-scale region-text pairs, it brings 0.5~1.5 AP improvements on LVIS minival evaluation. The motivation for removing I-PoolingAttention is that we find it hard to use in some deployment cases, especially for edge applications, though the latency is small. The V2 and V1 have several differences, the I-PoolingAttention, BatchNorm in the contrastive head, and the training strategies. The V2 version aims for practical applications and has been evaluated in different deployment scenarios.