Closed SCZwangxiao closed 1 year ago
Thank you for such interesting work!
The Token Selection Transformer adopts the
differentiable Topk
proposed in the paperDifferentiable Patch Selection for Image Recognition
.However, in their paper, hard topk is applied during the inference stage. Have you tried that? Or why don't you apply hard topk like them?
I ask this because it should apply hard topk during inference theoretically. Just like
Gumbel softmax
for top1 selection.
Thanks for your attention. In our method, we do not use hard topk because we think a soft topK actually works like a 'attention' mechanism but obtain less tokens. However, we do not compare the performance in these two types and maybe you can try and find which is better.
Thank you for such interesting work!
The Token Selection Transformer adopts the
differentiable Topk
proposed in the paperDifferentiable Patch Selection for Image Recognition
.However, in their paper, hard topk is applied during the inference stage. Have you tried that? Or why don't you apply hard topk like them?
I ask this because it should apply hard topk during inference theoretically. Just like
Gumbel softmax
for top1 selection.