Closed APeiZou closed 9 months ago
@MendelXu SAN的方法应该不能使用CNN-base的backbone吧
CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)
CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)
Have you ever try using CNN-based backbone like ResNet50
? Will there be a precipitous drop in performance when replacing the ViT-B/16
with ResNet50
?
Since many methods use ResNet50
pre-trained by CLIP
to do the zero-shot detection or segmentation, but some of them do not release the ViT
version. I'm curious about a fair comparison between SAN
and other methods on ResNet50
version.
No. We didn't try it because there are no as many CNN-based model as ViT (From Base to Huge), making it harder to test scaling law.
CNN更简单一点,直接把bias作为spatial attention就可以了:X=X.dot(Bias)
@MendelXu 你好,这个能讲的详细一点吗?有点没想明白
@MendelXu, thanks a lot
- 类似ViT从CNN不同stage抽取feature(例如resnet的res2,res3,res4,res5);
- 设计一个合理的side network;
- 从side network生成attn bias给CNN的head(例如resnet的global avg).
I understand that in SAN: in the ViT-B config, you use the first 9 layers of the ViT as the feature extractor and utilize the last three layers as the head. According to your experience, how should the head be designed when the feature extractor is replaced with a CNN? Should we use traditional CNN head like DeepLabV3Head, or is it the same as your previous design: the first N layers of the ResNet are used as the feature extractor, and the last M layers are used as the head?
Due to the difference between ResNet and ViT, I suspect that by using last M layers of ResNet directly as the Head, it is unable to output a sensible mask enough.
Due to the difference between ResNet and ViT, I suspect that by using last M layers of ResNet directly as the Head, it is unable to output a sensible mask enough.
I haven't thought it deeply. But I think a plain baseline could be using the classifer head only (In CLIP, it is an attentional pooling layer. In other model, it could be global average polling + linear projection.) as the head. Adding more complex module like ASPP into the head might improve its fitting ability but I am not sure whether it will downgrade the generalization ability.
- 类似ViT从CNN不同stage抽取feature(例如resnet的res2,res3,res4,res5);
- 设计一个合理的side network;
- 从side network生成attn bias给CNN的head(例如resnet的global avg).
Thank you for your quick reply. Also when it is mentioned here to use attn bias
in avg pooling
, does it mean attn avg pooling
after layer4
in CLIP
? I plan to use layer4
and attn avg pooling
in CLIP ModifiedResNet
as the AttnbiasHead
in SAN
. However, the layer4
in CLIP
implementation don't use attn avg pooling
, but naive avg pooling
. If we use layer4
and attn avg pooling
as the head, then only one layer(attn avg pooling
) can make use of the attn bias from the side adapter network
, would this be inappropriate in SAN
? Or just use the single attentional pooling layer as the AttnbiasHead
is better?
I am not sure whether it is enough to only use the last layer. But I think you have a try. In fact, even in the transformer based CLIP model, only using the last layer as the AttnbiasHead
do not bring very big performance gap (I forgot the concrete number, but it should be less than 2 mIOU on ADE150).
Thanks for your insights. I will try using the attn bias, generated from the last layer of the side adapter network, in the attn avg pooling
.
Thanks for the great work! I was wondering if it is possible to generate an attn_bias or attn_mask based on the GT masks. Or how can I extract the dense features of images based on GT masks? Thank you!
Do you mean extracting the features of the region within the gt mask? Or just pixel level features.
Sorry for my confused expression. Actually, I hope to get the features of the [sls] token for a gt binary mask. If I have a binary mask, can I get the classification score for the foreground? I was thinking of generating an attn_bias from the binary mask and thus getting the corresponding [SLS] token but not sure if it is possible.
It is possible. But the results may be slightly worse.
Thanks for your kind response. Do you have some insights about generating attn_bias from ground truth masks? I was thinking of an ideal scenario with perfect mask proposals and would like to see how the classification performance would be.
Sorry, I have no idea what is the perfect attention bias for each mask. But I can provide you a minimal case I have tried. GT Mask for Masked Attention:
Learned Attention Bias:
Alright, thanks for your kind response :) All the best!
@MendelXu 你好, 要运用CNN的backbone进行san方法的集成需要如何修改FeatureExtractor这个类?