Closed iumyx2612 closed 1 year ago
Thanks for the question. Personally speaking, this is correct. However, the model you describe seems difficult to be optimized according to my experiences and hence yields worse performance than using convolutions only. But I am not sure on that.
Thanks for the question. Personally speaking, this is correct. However, the model you describe seems difficult to be optimized according to my experiences and hence yields worse performance than using convolutions only. But I am not sure on that.
The model I described was almost the same as yours in the paper. So it means the Conv modulation module in Conv2Former is indeed "adapt to the input" right? Since the statement "The difference is that the convolutional kernels are static while the attention matrix generated by self-attention can adapt to the input" feels like you're denying it. Sorry if I misinterpret your statement
@houqb Hello any updates on the situation?
My saying about the kernels are static is true. If you view the feature maps generated by convs as the kernels of conv modulation, that might be true.
My saying about the kernels are static is true. If you view the feature maps generated by convs as the kernels of conv modulation, that might be true.
Thank you
Hello, in the paper authors stated that: "The difference is that the convolutional kernels are static while the attention matrix generated by self-attention can adapt to the input"
Yes, the statement is indeed correct. However, I still don't quite get why authors wrote it like that, and please correct me if I'm wrong in this one.
Considering Self-attention first. During inference, attention matrix are generated using a Linear layer. Therefore it can adapt to the input. Next, is Conv. Can't we treat the whole conv modulation as one conv where: the kernel of conv modulation is generate using a conv layer, therefore this kernel of this conv modulation can adapt to the input as well?