Closed 123456789asdfjkl closed 1 year ago
Hi @123456789asdfjkl Actually, the proposal of the Multi-head Modular Modification is motivated by our empirical studies. Our experiments in Figure 4 demonstrates the benefits of Multi-head Modular Modification over Single-head Modular Modification. If you are interested in delving deeper into the benefits of Multi-head, I think you can gain some valuable insights from some existing analyses on the benefits of utilizing the multi-head attention mechanism in Transformer.
好的,感谢您的解答
Thanks for you issue. I think we can close this issue now:)
您好!非常感谢您的杰出工作!您设计这个多头机制根据分配律从数学上跟直接使用
\mathbf{W} \in \mathbb{R}^{d \times {r}}
是一样的,想问一下您这个修改是跟优化有关吗,可能更适合做梯度下降?