IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.97k stars 206 forks source link

Questions about the hw-modulated attention in DAB-DETR #193

Open Artificial-Inability opened 1 year ago

Artificial-Inability commented 1 year ago

I have two questions about hw-modulated attention equation (Eq.(6) in DAB-DETR):

  1. Why use 1/wq and 1/hq instead of wq and hq? Does that mean an anchor with larger width will result in a narrower shape attention map in x direction?
  2. DAB-DETR has already updated the 4D anchor in each decoder layer using the embedding of last laryer through MLP, why we still need wref and href which are also generated using the embedding of last layer through MLP? Is that necessary?
SlongLiu commented 1 year ago
  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

  2. href and wref are designed to keep the same dimension with hq and wq. It helps to the final performance.

Artificial-Inability commented 1 year ago
  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

SlongLiu commented 1 year ago
  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

The results in Fig 6 are examples. "H=1, W=3" means hq =1, wq = 3. We suppose the href and wref are 1.

Artificial-Inability commented 1 year ago
  1. 1/wq can make sure that attention maps have a similar shape as the anchor boxes. For example, a large w can result in a flatten attention map in the x direction under the 1/wq formulation. We provide some visualizations in our paper.

Could you give a more detailed explanation about how this works? My personal understanding of the "H=1, W=3" in Figure 6 of the DAB paper is that "href/hq = 1, wref/wq = 3", in which larger wq will lead to smaller W. If I misunderstood something, what is the definition of H and W in Figure6? Thanks.

The results in Fig 6 are examples. "H=1, W=3" means hq =1, wq = 3. We suppose the href and wref are 1.

I couldn't understand this phenomenon theoretically. If the origin value of attention map at a fix point is calculated by (PE(x)PE(xref)wref/wq + ... When we increase wq to wq'=3*wq, the new value should decrease, which will result in a narrower shape attention map. Could you explain why larger wq leads to wider atten map with the formulation theoretically? Thanks.