Traffic-X / ViT-CoMer

Official implementation of the CVPR 2024 paper ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions.
Apache License 2.0
231 stars 16 forks source link

How to combine Comer and Swin Transformer? #11

Closed wujians122 closed 5 months ago

fyting commented 5 months ago

First, we know that ViT-CoMer utilizes multiple scales from the CNN, including C3=/8, C4=/16, and C5=/32. In ViT, the feature map obtained at each stage is at a scale of /16. Therefore, in CTI, it is added to the C4 from the CNN. The difference between Swin and ViT is that the feature map scale is not always /16. Like the CNN, Swin has scales of S3=/8, S4=/16, and S5=/32. Accordingly, in CTI, S3 is added to C3, S4 is added to C4, and S5 is added to C5.