Open 5g4s opened 1 year ago
(i) The performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding. (ii) The hierarchical design of Swin can be simplified into hierarchical patch embedding. (iii) Other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiVit, which is simpler and more efficient than Swin yet further improves its performance.
https://openreview.net/pdf?id=3F6I-0-57SC