cjw2021 / QAHOI

Apache License 2.0
47 stars 9 forks source link

Question about the performance influence of the resolution of pretrained swin transformer on the HICO-DET #10

Closed truetone2022 closed 1 year ago

truetone2022 commented 2 years ago

Have you tried to train the QAHOI which is based on the ImageNet-22K 224x224 pre-trained swin transformer backbone?

cjw2021 commented 2 years ago

image

Here are some other experiments we did. The Swin-Base* in this table is based on the ImageNet-22K with 224x224 resolution.

truetone2022 commented 2 years ago

Thanks! Furthermore, I wonder why the QAHOI with R50 backbone cannot surpass the QPIC with R50 backbone? Intuitively, deformable-detr-r50 should perform better than detr-r50 on HICO-DET dataset

truetone2022 commented 2 years ago
image

Both of them are fine-tuned on COCO, but QAHOI cannot acquire a huge performance boost like QPIC, it's strange.

cjw2021 commented 2 years ago
epoch AP AP-S AP-M AP-L
Deformable DETR 50 44.5 27.1 47.6 59.6
DETR 500 42.0 20.5 45.8 61.1

We use the deformable-detr-r50 trained 50 epochs on COCO.

I think there are three main reasons that QAHOI-r50-fine is lower that QPIC-r50-fine:

  1. The AP-L of deformable-detr-r50 is lower than detr-r50.
    • QPIC only use the low-resolution high-level feature map, and this is beneficial for large object detection. There are 600 HOI categories, although we have not evaluated the spatial distribution of each category. The large size target may be more important than the median or small ones.
    • The deformable-detr-r50 trained 150 epochs has better performance. Training a deformable-detr with higher AP-L may get better for the HOI task.
  2. The deformable-detr-r50 uses four feature maps.
    • As the multi-scale experiments of QAHO-swin-tiny, using three feature maps is better.
  3. The attention mechanism.
    • The attention mechanism of the DETR to achieve global attention may be better than Deformable-DETR. For reference, you can check this paper. Vision Transformer with Deformable Attention
    • Although QAHOI uses multi-scale feature maps to improve the detection part, the recognition part which requires global semantic information may be weaker than QPIC.

If you have any ideas, feel free to point them out.

truetone2022 commented 2 years ago

Very thanks for your helpful reply!