Closed Andy1621 closed 2 years ago
Hi! In fact, the performance is enough good when just preserving the placeholder tokens in DeiT. Thus, we do not use representative tokens for further updating based on the principle of Occam’s razor. Further analyses are in Tab.3 and the 'effectiveness of each module' paragraph. In addition, just preserving the placeholder tokens can be treated as the fastest updating manner. We use representative tokens for fast updating in LeViT. Please refer to the corresponding codes. Hope it is helpful to you!
In the original paper, there is a special token named representative token, which is aggregated by the placeholder tokens. However, there is no corresponding implementation in your code.
In fact, you simply use
argsort
and select the topk informative tokens, which is non-differentiable.I'm curious about the performance of using aggregating tokens and differentiable topk used in other paper. Hopefully for your reply.