Open Journey7331 opened 1 month ago
We say that S6 have Gating mechanism inside as we can rewrite the SSM into attention-like form $$(((Q \odot W )(\frac{K}{W})^T ) \odot M)V$$ So $W$ is like the Gate added into a "linear transform", together make this special attention SSM. For details, you can refer to the visualization section of the paper.
Thanks for your quick reply :)
But are there any other ablation experiments or attention map comparisons that can prove this? I think it may be easier to get this point. 👀 If there are any such studies, that would be greatly appreciated. ❤️
Hi, @MzeroMiko, I see this in your paper, and I see
noz
in your model yaml.How to explain
the gating mechanism has already been implemented by the selectivity of SS2D
? Is there any code lines could explain this?