KU-CVLAB / CAT-Seg

Official Implementation of "CAT-Seg🐱: Cost Aggregation for Open-Vocabulary Semantic Segmentation"
https://ku-cvlab.github.io/CAT-Seg/
MIT License
247 stars 25 forks source link

More details about Feature Agg #21

Closed lchen1019 closed 4 months ago

lchen1019 commented 5 months ago

I noticed that Feature Agg and Cost Agg were compared in the paper, and cost Agg performed better. The paper says that the difference between them is that the features used during aggregation. It's that means Feature Agg only use the highest level features, while Cost Agg also use an additional cost?

"For both of baseline architectures, we simply apply the upsampling decoder and note that both methods share most of the architecture, but differ in whether they aggregate the concatenated features or aggregate the cosine similarity between image and text embeddings of CLIP." from your papers.

Thanks in advance!

hsshin98 commented 4 months ago

Hi, both feature and cost aggregation mainly use the highest layer features, and the difference is that whether we use the "feature volume", which is obtained by concatenating the image and text features of CLIP(512+512=1024dim) or the "cost volume" which is obtained by the exact same features, but with a different operation being cosine similarity(1dim). In addition, both baselines use intermediate CLIP features for the upsampling decoder, so they would not have any difference in terms of what level of features they are using.