Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)
Apache License 2.0
358 stars 48 forks source link

The diagnoal matrix meaning? #16

Closed JosonChan1998 closed 2 years ago

JosonChan1998 commented 2 years ago

Hi, thank your nice work about Transformer in Object Detection. But I have some questions when reading the paper and code. I hope you can give me some answers。

  1. What 's the insight of the pos_transformation T in 3.3 ?

  2. What 's the meaning about diagonal vector \lamda q described in 3.3. And I don't find the code about the diagonal operator in this repo. And i just find the pos_transformation just generated by learnable weights : https://github.com/Atten4Vis/ConditionalDETR/blob/0b04a859c7fac33a866fcdea06f338610ba6e9d8/models/transformer.py#L151

  3. I can't figure out the difference bewteen "Block" , "Full" and "Diagonal" in Fig5.

The above are all my questions. I sincerely hope I can get your help. Thanks!

DeppMeng commented 2 years ago

Sorry for the late reply.

  1. About T. T is a learnable linear projection. It is obtained by applying a FFN on decoder embedding f. Since f contains displacement information of the distinct regions w.r.t the reference point, so we expect T to be a displacement transformation in p embedding space. T could be a full matrix, a block matrix, or a diagonal matrix. We empirically studied these types of matrices and choose the diagonal option.
  2. \lambda_q is the diagonal elements of matrix T. It is `pos_transformation' in our code.
  3. For the details please refer to paragraph ``The effect of linear projections T forming the transfor- mation.'' in our paper.
JosonChan1998 commented 2 years ago

Thanks for your reply!

WYHZQ commented 1 year ago

Sorry for the late reply.

1. About T. T is a learnable linear projection. It is obtained by applying a FFN on decoder embedding f. Since f contains displacement information of the distinct regions w.r.t the reference point, so we expect T to be a displacement transformation in p embedding space. T could be a full matrix, a block matrix, or a diagonal matrix. We empirically studied these types of matrices and choose the diagonal option.

2. \lambda_q is the diagonal elements of matrix T. It is `pos_transformation' in our code.

3. For the details please refer to paragraph ``The effect of linear projections T forming the transfor- mation.'' in our paper.

Thank you for your reply. You said lamq is the diagonal element of matrix A. But the "pos_transformation" obtained after FFN does not extract diagonal elements, but directly performs point multiplication with "query_sine_embedded", that is, "query_sine_embedded=query_sine_embedded * pos_transformation". Can you explain the principle?

Vincent-luo commented 10 months ago

@WYHZQ Have you figured it out? I have the same confusion.