YifanXu74 / Evo-ViT

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
MIT License
69 stars 5 forks source link

The code does not match the pipeline in your paper #3

Closed Andy1621 closed 2 years ago

Andy1621 commented 2 years ago

In the original paper, there is a special token named representative token, which is aggregated by the placeholder tokens. However, there is no corresponding implementation in your code.

In fact, you simply use argsort and select the topk informative tokens, which is non-differentiable.

# topk for slow update
x = x_[:, :N_ + 1] # L438
# simply copy for fast update
x = torch.cat((x, x_[:, N_ + 1:]), dim=1) # L473

I'm curious about the performance of using aggregating tokens and differentiable topk used in other paper. Hopefully for your reply.

YifanXu74 commented 2 years ago

Hi! In fact, the performance is enough good when just preserving the placeholder tokens in DeiT. Thus, we do not use representative tokens for further updating based on the principle of Occam’s razor. Further analyses are in Tab.3 and the 'effectiveness of each module' paragraph. In addition, just preserving the placeholder tokens can be treated as the fastest updating manner. We use representative tokens for fast updating in LeViT. Please refer to the corresponding codes. Hope it is helpful to you!