baaivision / EVA

EVA Series: Visual Representation Fantasies from BAAI
MIT License
2.22k stars 162 forks source link

Why no ViT-Adapter for semantic segmentation on ADE20K for EVA02? #96

Open tommiekerssies opened 1 year ago

tommiekerssies commented 1 year ago

Title says it all

Yuxin-CV commented 1 year ago

We found using ViT-Adapter degenerate the performance on EVA-02

tommiekerssies commented 1 year ago

Very interesting, do you have any intuition why that may be?

function2-llx commented 1 year ago

@Yuxin-CV Hello, I have a question related to the application of EVA-02 for semantic segmentation. Since ViT-Adapter is not used, does this imply that all feature maps received by the task layer are at 1/16 of the original resolution? Similarly, is the output of the task layer (prior to final interpolation) also at 1/16 of the original resolution? Or is there any technique employed to obtain hierarchical feature maps from the backbone for semantic segmentation? I couldn't find explicit details in the EVA-02 paper. Thank you.

UPDATE: found the answer in code https://github.com/baaivision/EVA/blob/7389aeeec97c056fc8424fa6b78f35c6f1b07d0d/EVA-02/seg/backbone/eva2.py#L610-L623

tommiekerssies commented 1 year ago

@function2-llx Great, thank you for sharing!

tommiekerssies commented 1 year ago

I wonder if a LayerNorm would also work here or if it has to be a BatchNorm. Is there literature on this?