Traffic-X / ViT-CoMer

Official implementation of the CVPR 2024 paper ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions.
Apache License 2.0
231 stars 16 forks source link

Issues About Feature Outputs #12

Closed Flash-Alita closed 6 months ago

Flash-Alita commented 6 months ago

Hi, there. Can I ask you about why the outputs of the model are with four features, which return [f1, f2, f3, f4], and how the features work in the end, like plans use the flatten and SoftMax like Vits. I was also looking for the network arch, but I can't find it, would you offer it? Lots of thanks.

clxia12 commented 6 months ago

ViT-CoMer, as a multi-scale backbone, shares a similar usage pattern with other similar backbones like Swin. You can refer to the network structure in vit_comer.py.