MendelXu / SAN

Open-vocabulary Semantic Segmentation
https://mendelxu.github.io/SAN/
MIT License
295 stars 27 forks source link

要运用CNN的backbone进行san方法的集成需要如何修改FeatureExtractor这个类 #38

Closed APeiZou closed 9 months ago

APeiZou commented 9 months ago

@MendelXu 你好, 要运用CNN的backbone进行san方法的集成需要如何修改FeatureExtractor这个类?

APeiZou commented 9 months ago

@MendelXu SAN的方法应该不能使用CNN-base的backbone吧

MendelXu commented 9 months ago

CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)

hollow-503 commented 9 months ago

CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)

Have you ever try using CNN-based backbone like ResNet50? Will there be a precipitous drop in performance when replacing the ViT-B/16 with ResNet50?

Since many methods use ResNet50 pre-trained by CLIP to do the zero-shot detection or segmentation, but some of them do not release the ViT version. I'm curious about a fair comparison between SAN and other methods on ResNet50 version.

MendelXu commented 9 months ago

No. We didn't try it because there are no as many CNN-based model as ViT (From Base to Huge), making it harder to test scaling law.

APeiZou commented 9 months ago

CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)

@MendelXu 你好,这个能讲的详细一点吗?有点没想明白

MendelXu commented 9 months ago
  1. 类似ViT从CNN不同stage抽取feature(例如resnet的res2,res3,res4,res5);
  2. 设计一个合理的side network;
  3. 从side network生成attn bias给CNN的head(例如resnet的global avg).
APeiZou commented 9 months ago

@MendelXu, thanks a lot

hollow-503 commented 9 months ago
  1. 类似ViT从CNN不同stage抽取feature(例如resnet的res2,res3,res4,res5);
  2. 设计一个合理的side network;
  3. 从side network生成attn bias给CNN的head(例如resnet的global avg).

I understand that in SAN: in the ViT-B config, you use the first 9 layers of the ViT as the feature extractor and utilize the last three layers as the head. According to your experience, how should the head be designed when the feature extractor is replaced with a CNN? Should we use traditional CNN head like DeepLabV3Head, or is it the same as your previous design: the first N layers of the ResNet are used as the feature extractor, and the last M layers are used as the head?

hollow-503 commented 9 months ago

Due to the difference between ResNet and ViT, I suspect that by using last M layers of ResNet directly as the Head, it is unable to output a sensible mask enough.

MendelXu commented 9 months ago

Due to the difference between ResNet and ViT, I suspect that by using last M layers of ResNet directly as the Head, it is unable to output a sensible mask enough.

I haven't thought it deeply. But I think a plain baseline could be using the classifer head only (In CLIP, it is an attentional pooling layer. In other model, it could be global average polling + linear projection.) as the head. Adding more complex module like ASPP into the head might improve its fitting ability but I am not sure whether it will downgrade the generalization ability.

hollow-503 commented 9 months ago
  1. 类似ViT从CNN不同stage抽取feature(例如resnet的res2,res3,res4,res5);
  2. 设计一个合理的side network;
  3. 从side network生成attn bias给CNN的head(例如resnet的global avg).

Thank you for your quick reply. Also when it is mentioned here to use attn bias in avg pooling, does it mean attn avg pooling after layer4 in CLIP ? I plan to use layer4 and attn avg pooling in CLIP ModifiedResNet as the AttnbiasHead in SAN. However, the layer4 in CLIP implementation don't use attn avg pooling, but naive avg pooling. If we use layer4 and attn avg pooling as the head, then only one layer(attn avg pooling) can make use of the attn bias from the side adapter network, would this be inappropriate in SAN? Or just use the single attentional pooling layer as the AttnbiasHead is better?

MendelXu commented 9 months ago

I am not sure whether it is enough to only use the last layer. But I think you have a try. In fact, even in the transformer based CLIP model, only using the last layer as the AttnbiasHead do not bring very big performance gap (I forgot the concrete number, but it should be less than 2 mIOU on ADE150).

hollow-503 commented 9 months ago

Thanks for your insights. I will try using the attn bias, generated from the last layer of the side adapter network, in the attn avg pooling.

xiaollu commented 8 months ago

Thanks for the great work! I was wondering if it is possible to generate an attn_bias or attn_mask based on the GT masks. Or how can I extract the dense features of images based on GT masks? Thank you!

MendelXu commented 8 months ago

Do you mean extracting the features of the region within the gt mask? Or just pixel level features.

xiaollu commented 8 months ago

Sorry for my confused expression. Actually, I hope to get the features of the [sls] token for a gt binary mask. If I have a binary mask, can I get the classification score for the foreground? I was thinking of generating an attn_bias from the binary mask and thus getting the corresponding [SLS] token but not sure if it is possible.

MendelXu commented 8 months ago

It is possible. But the results may be slightly worse.

xiaollu commented 7 months ago

Thanks for your kind response. Do you have some insights about generating attn_bias from ground truth masks? I was thinking of an ideal scenario with perfect mask proposals and would like to see how the classification performance would be.

MendelXu commented 7 months ago

Sorry, I have no idea what is the perfect attention bias for each mask. But I can provide you a minimal case I have tried. GT Mask for Masked Attention: image

Learned Attention Bias: image

xiaollu commented 7 months ago

Alright, thanks for your kind response :) All the best!