Closed philgzl closed 6 months ago
- Is the
state
argument of the differentfoward
methods alwaysNone
in your experiments? If not, when should it be set to something different fromNone
?
It's always None
in our experiments.
When using Retention, I can see you are sharing the query and key projection layers when RoPE is disabled here. Can you explain why? This does not seem to be explained in the paper.
The sharing of query and key projection layers didn't degrade the performance in our experiments, but reduces the parameters and computational cost.
Thanks!
Hi again,
state
argument of the differentfoward
methods alwaysNone
in your experiments? If not, when should it be set to something different fromNone
?