HVision-NKU / Conv2Former

MIT License
145 stars 11 forks source link

Trying to understand the Conv modulation #2

Closed iumyx2612 closed 1 year ago

iumyx2612 commented 1 year ago

Hello, in the paper authors stated that: "The difference is that the convolutional kernels are static while the attention matrix generated by self-attention can adapt to the input"

Yes, the statement is indeed correct. However, I still don't quite get why authors wrote it like that, and please correct me if I'm wrong in this one.

Considering Self-attention first. During inference, attention matrix are generated using a Linear layer. Therefore it can adapt to the input. Next, is Conv. Can't we treat the whole conv modulation as one conv where: the kernel of conv modulation is generate using a conv layer, therefore this kernel of this conv modulation can adapt to the input as well?

houqb commented 1 year ago

Thanks for the question. Personally speaking, this is correct. However, the model you describe seems difficult to be optimized according to my experiences and hence yields worse performance than using convolutions only. But I am not sure on that.

iumyx2612 commented 1 year ago

Thanks for the question. Personally speaking, this is correct. However, the model you describe seems difficult to be optimized according to my experiences and hence yields worse performance than using convolutions only. But I am not sure on that.

The model I described was almost the same as yours in the paper. So it means the Conv modulation module in Conv2Former is indeed "adapt to the input" right? Since the statement "The difference is that the convolutional kernels are static while the attention matrix generated by self-attention can adapt to the input" feels like you're denying it. Sorry if I misinterpret your statement

iumyx2612 commented 1 year ago

@houqb Hello any updates on the situation?

houqb commented 1 year ago

My saying about the kernels are static is true. If you view the feature maps generated by convs as the kernels of conv modulation, that might be true.

iumyx2612 commented 1 year ago

My saying about the kernels are static is true. If you view the feature maps generated by convs as the kernels of conv modulation, that might be true.

Thank you