Closed JosonChan1998 closed 2 years ago
Sorry for the late reply.
Thanks for your reply!
Sorry for the late reply.
1. About T. T is a learnable linear projection. It is obtained by applying a FFN on decoder embedding f. Since f contains displacement information of the distinct regions w.r.t the reference point, so we expect T to be a displacement transformation in p embedding space. T could be a full matrix, a block matrix, or a diagonal matrix. We empirically studied these types of matrices and choose the diagonal option. 2. \lambda_q is the diagonal elements of matrix T. It is `pos_transformation' in our code. 3. For the details please refer to paragraph ``The effect of linear projections T forming the transfor- mation.'' in our paper.
Thank you for your reply. You said lamq is the diagonal element of matrix A. But the "pos_transformation" obtained after FFN does not extract diagonal elements, but directly performs point multiplication with "query_sine_embedded", that is, "query_sine_embedded=query_sine_embedded * pos_transformation". Can you explain the principle?
@WYHZQ Have you figured it out? I have the same confusion.
Hi, thank your nice work about Transformer in Object Detection. But I have some questions when reading the paper and code. I hope you can give me some answers。
What 's the insight of the pos_transformation T in 3.3 ?
What 's the meaning about diagonal vector \lamda q described in 3.3. And I don't find the code about the diagonal operator in this repo. And i just find the pos_transformation just generated by learnable weights : https://github.com/Atten4Vis/ConditionalDETR/blob/0b04a859c7fac33a866fcdea06f338610ba6e9d8/models/transformer.py#L151
I can't figure out the difference bewteen "Block" , "Full" and "Diagonal" in Fig5.
The above are all my questions. I sincerely hope I can get your help. Thanks!