about appendix C3 - Githubissues

Hi, thanks for your interest! In fact, this is one of the possible directions to explore, and I will say that there is no definitive answer for this intuition. A straightforward modification will be discarding the Gumbel-Softmax discretization after the additional token division module and directly multiplying the raw attention matrix by the normalized (e.g. sigmoid) prediction output as a scaling. However, since the current token division module is also a local transformation operated on each single token as the linear projections of query, key and value, the performance of this kind of modification may be limited. That is why I am more optimistic about incorporating some hierarchical weight designs to complement the token-level attention weights produced by queries and keys. For example, you can imagine some tokens with similar semantic meanings share the same scaling factor which is applied to their independent attention weights. The semantic groups here can be obtained through some clustering techniques, and there are already plenty of practices in NLP or in CV detection tasks in recent two years. I am sure you can find several related works easily.

Little-Podi / GRM

about appendix C3 #3