Why don't you apply hard top-K during inference?

LiuRicky / ts2_net

[ECCV2022] A pytorch implementation for TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

76 stars 9 forks source link

Thank you for such interesting work!

The Token Selection Transformer adopts the differentiable Topk proposed in the paper Differentiable Patch Selection for Image Recognition.

However, in their paper, hard topk is applied during the inference stage. Have you tried that? Or why don't you apply hard topk like them?

I ask this because it should apply hard topk during inference theoretically. Just like Gumbel softmax for top1 selection.

Thanks for your attention. In our method, we do not use hard topk because we think a soft topK actually works like a 'attention' mechanism but obtain less tokens. However, we do not compare the performance in these two types and maybe you can try and find which is better.

LiuRicky / ts2_net

Why don't you apply hard top-K during inference? #10