Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)
Apache License 2.0
369 stars 50 forks source link

what's the main purpose of concat (pos_y, pos_x) instead of (pos_x, pos_y)? #19

Closed kuixu closed 2 years ago

kuixu commented 2 years ago

Great work on improving the training convergence of DETR! One minor question, what's the main purpose of concat (pos_y, pos_x) instead of (pos_x, pos_y) in the Decoder? https://github.com/Atten4Vis/ConditionalDETR/blob/0b04a859c7fac33a866fcdea06f338610ba6e9d8/models/transformer.py#L45

DeppMeng commented 2 years ago

Thanks for your attention. About the order of pos_y and pos_x, we just follow the default order provided in DETR positional encoding, please refer to https://github.com/Atten4Vis/ConditionalDETR/blob/0b04a859c7fac33a866fcdea06f338610ba6e9d8/models/position_encoding.py#L55