5g4s / paper

0 stars 0 forks source link

HIVIT: A SIMPLER AND MORE EFFICIENT DESIGN OF HIERARCHICAL VISION TRANSFORMER #25

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://openreview.net/pdf?id=3F6I-0-57SC

5g4s commented 1 year ago

(i) The performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding. (ii) The hierarchical design of Swin can be simplified into hierarchical patch embedding. (iii) Other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiVit, which is simpler and more efficient than Swin yet further improves its performance.