Open chenying0722 opened 3 months ago
在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数
Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning $q$ and $k$, so we did not do this operation for $v$. Nevertheless, we added this operation for $v$ as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.
在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数
Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.
I understand. Thank you for your answer!
在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数
Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.
Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!
Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!
Thanks for your question!
If we fix the curvature c
as a negative value (i.e., c<0
), and then - c == abs(c)
, for example, -(-1) == abs(-1) = 1
.
Therefore, we directly set a k
, a positive value, and apply +1
in the equations, equivalent to -(-1)
.
Using k
or 1/k
are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code, +k
represents that the curvature is −1/k
.
But why do we use +k
in the code? since it is easy to implement and optimize the parameters.
Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!
Thanks for your question!
For the first question
If we fix the curvature
c
as a negative value (i.e.,c<0
), and then- c == abs(c)
, for example,-(-1) == abs(-1) = 1
.Therefore, we directly set a
k
, a positive value, and apply+1
in the equations, equivalent to-(-1)
.For the second question
Using
k
or1/k
are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code,+k
represents that the curvature is−1/k
.But why do we use
+k
in the code? since it is easy to implement and optimize the parameters.
Thank you for your answer! I saw that the curvature was described as k < 0, which led me to think that k is also negative in the code. Now I understand. Thanks again!
在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数