Closed chandlerbing65nm closed 2 years ago
Hi Chandler,
The channel counts correspond as follows: d = in_channels, d_k = key_channels, d_v = value_channels
. head_count
must divide key_channels
and value_channels
, but not necessarily in_channels
. If you don't know what value to set it to, 8
is usually a good default value.
@cmsflash what should be the values of the d = in_channels, d_k = key_channels, d_v = value_channels
or the default values? Should I set it all to be equal to channels
? If my input is like this m_batchsize, channels, height, width = input_.size()
or like this:
1, 256, 28, 28= x.size()`
Attention = EfficientAttention(in_channels = 256, key_channels = 256, head_count = 8, value_channels = 256) Attention(x)
Hi Chandler, d = in_channels, d_k = key_channels, d_v = value_channels
is just the correspondence of the variables between the figure and the code, because we were using more mathematical notations in the figure, and more programmatic notations in the code.
in_channels
obviously is the channel count of your input, so it is equal to channels
. value_channels
decides the channel count of the output, so if you want to keep it the same as the input, then it's also channels
. You are free to set key_channels
to tune the computational cost of the module, the higher it is, the more costly the module is, which usually also leads to better performance. If you don't have an idea, then key_channels = in_channels
or key_channels = in_channels // 2
are good default values.
Thank you very much @cmsflash. Now I understand how to use attention blocks.
hi, i am trying to use the efficient attention module, but I found that key_channels and value_channels has no impact with the output dimension
Hi Chandler,
d = in_channels, d_k = key_channels, d_v = value_channels
is just the correspondence of the variables between the figure and the code, because we were using more mathematical notations in the figure, and more programmatic notations in the code.
in_channels
obviously is the channel count of your input, so it is equal tochannels
.value_channels
decides the channel count of the output, so if you want to keep it the same as the input, then it's alsochannels
. You are free to setkey_channels
to tune the computational cost of the module, the higher it is, the more costly the module is, which usually also leads to better performance. If you don't have an idea, thenkey_channels = in_channels
orkey_channels = in_channels // 2
are good default values.
Hi @feimadada, sorry for the misunderstanding. value_channels
only controls the channel count for the output of the core attention step. The default EA module, as I implemented here, has a residual connection around the attention step. Therefore, it has an additional self.reprojection
module to project the output back to the same number of channels as the input and then add them up before returning.
If you want to have the output dimension different from the input, I'd suggest you either add an additional linear layer after the EA module or modify the code the remove the residual connection and reprojection.
For follow-up discussion on the issue @feimadada raised, refer to the dedicaticated issue #10.
I'm trying to understand how to use your attention module based on the figure above and the code below.
From what I understand from the non-local paper, if I have an input feature of
m_batchsize, channels, height, width = input_.size()
,then
n = m_batchsize*height*width
andd = channels
.So in the code below, I should use
channels = in_channels, key_channels, value_channels
.But what should the
head_count
be? Should it be divisible by the number ofchannels
?