HVision-NKU / CamoFormer

MIT License
68 stars 15 forks source link

about B-TA #9

Open Qiublack opened 1 year ago

Qiublack commented 1 year ago

Thank you for your excellent work. I have some confusion about B-TA, the background map generated in the text using subtraction is multiplied with Q and K. So that the object in the background map is 0 and the object in QK is 1. Wouldn't the multiplication make the generated feature map all 0?

Lwt-diamond commented 1 year ago

Hello, in section 3.2 of the paper, introducing Mask Generation, the author says that elements in mask are in the range of 0 to 1. I wonder if my answer will answer your question.

Qiublack commented 1 year ago

Hello, in section 3.2 of the paper, introducing Mask Generation, the author says that elements in mask are in the range of 0 to 1. I wonder if my answer will answer your question. (In Mask Generation, the author says to use 3x3 convolution to generate masks, but I don't seem to see it in the code)

The point of confusion for me is this, the object in the background mask is 0 and the background is 1, which is the opposite of the feature map. Therefore the multiplication will be all zeros and does not serve to enhance the features.

Lwt-diamond commented 1 year ago

The author uses the self-generated feature graph (E5,D4, D3,D2), and then converts the number of channels from 128 to 1 through a 3x3 convolution. The range of elements in the generated mask map is 0 to 1. You can see that from the code in decoder_p.py.

Qiublack commented 1 year ago

作者使用自生成特征图(E5,D4,D3,D2),然后通过128x1卷积将通道数从3转换为3。生成的掩码映射中的元素范围为 0 到 1。您可以从decoder_p.py中的代码中看到这一点。

I know how the mask is generated. What I want to know is why the background mask can be multiplied with the QK, which doesn't quite make sense.

SilentWhiteRabbit commented 5 months ago

I feel that your question is valuable. As far as the author's model diagram is concerned, the mask here should be (foreground is 1, background is 0), so QK's matrix multiplication is to only focus on the correlation of pixels in the foreground (if the foreground is 0 and the background is 1, then it is only concerned about the interrelationship of the background, of course, I think the two are about the same)... However, it is important that either the foreground pixels are 0 and the background pixels are 0 in the QK product. This is fatal, because when QK is multiplied by V, there is a situation where either the foreground is all 0 or the background is all 0. The output mask simply calculates the relationship between pixels on top of the foreground of the input mask. For a coarse to fine process is illogical.

SilentWhiteRabbit commented 5 months ago

我觉得你的问题很有价值。就笔者的模型图而言,这里的掩码应该是(前景为1,背景为0),所以QK的矩阵乘法是只关注前景中像素的相关性(如果前景为0,背景为1,那么它只关心背景的相互关系, 当然,我认为两者差不多)......但是,在 QK 产品中,前景像素为 0,背景像素为 0,这一点很重要。这是致命的,因为当 QK 乘以 V 时,会出现前景全为 0 或背景全为 0 的情况。输出掩码只是计算输入掩码前景顶部像素之间的关系。对于从粗到细的过程是不合逻辑的。

另外,如果说这个可能是逻辑的话。resnet-50的backbone和384^2的分辨率,在cod10k上达到了0.29的结果,这个指标结果完全和其他模型拉开了差距。十分期待全部代码的公开。