Closed alistlhks closed 1 year ago
Hi, thanks for your interest! In fact, this is one of the possible directions to explore, and I will say that there is no definitive answer for this intuition. A straightforward modification will be discarding the Gumbel-Softmax discretization after the additional token division module and directly multiplying the raw attention matrix by the normalized (e.g. sigmoid) prediction output as a scaling. However, since the current token division module is also a local transformation operated on each single token as the linear projections of query, key and value, the performance of this kind of modification may be limited. That is why I am more optimistic about incorporating some hierarchical weight designs to complement the token-level attention weights produced by queries and keys. For example, you can imagine some tokens with similar semantic meanings share the same scaling factor which is applied to their independent attention weights. The semantic groups here can be obtained through some clustering techniques, and there are already plenty of practices in NLP or in CV detection tasks in recent two years. I am sure you can find several related works easily.
Hi, congratulations, this is a nice job! I have a question about Appendix C3. In the paper, it can use the continuous estimation to scale the raw attention weights that bypass the problem of the non-differentiable obstacle. But how to do that? Can you give me some references? Hope your reply!