Hi author, you have done a great job and I am very interested in the work you have researched. I have some doubts about the attention aspect of the paper. The group attention designed in the paper divides query, key, value into multiple groups and then computes the attention in parallel, after which each group is max-pooled and then the attention is computed again between groups. However, the code only computes attention within groups, but the maximum pooling and attention between groups seem not to be implemented, which I am a bit confused about.
The code seems to calculate the group's attention and then cat the result directly.
Looking forward to your answers and replies!
I have the same question. And I find it even dosen't divide the groups. You said the "computes attention within groups" just as the multi-head attention.
Hi author, you have done a great job and I am very interested in the work you have researched. I have some doubts about the attention aspect of the paper. The group attention designed in the paper divides query, key, value into multiple groups and then computes the attention in parallel, after which each group is max-pooled and then the attention is computed again between groups. However, the code only computes attention within groups, but the maximum pooling and attention between groups seem not to be implemented, which I am a bit confused about. The code seems to calculate the group's attention and then cat the result directly. Looking forward to your answers and replies!