Graph-and-Geometric-Learning / hyperbolic-transformer

MIT License
18 stars 4 forks source link

linear_focus_attention问题 #1

Open chenying0722 opened 3 months ago

chenying0722 commented 3 months ago

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

marlin-codes commented 3 months ago

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning $q$ and $k$, so we did not do this operation for $v$. Nevertheless, we added this operation for $v$ as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

chenying0722 commented 3 months ago

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

I understand. Thank you for your answer!

chenying0722 commented 3 months ago

在linear_focus_attention这部分,为什么不对v值进行phi_qs = (F.relu(qs) + 1e-6) / (self.norm_scale.abs() + 1e-6)类似的操作呢?因为我看到论文里公式(15)对Q_s、K_s和V_s都应用了Phi函数

Thank you for your comments. This is a typo; our results were obtained without this operation. Our motivation was to make attention more focused, particularly concerning q and k , so we did not do this operation for v . Nevertheless, we added this operation for v as well after you opened this issue, and we found that it has minimal impact on the final results. Thank you again.

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

marlin-codes commented 3 months ago

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

Thanks for your question!

For the first question

If we fix the curvature c as a negative value (i.e., c<0), and then - c == abs(c), for example, -(-1) == abs(-1) = 1.

Therefore, we directly set a k, a positive value, and apply +1 in the equations, equivalent to -(-1).

For the second question

Using k or 1/k are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code, +k represents that the curvature is −1/k.

But why do we use +k in the code? since it is easy to implement and optimize the parameters.

chenying0722 commented 3 months ago

Sorry, I still have a question. In the paper, formula(18) deal with Z_t by subtracting the reciprocal of curvature k, while the code adds a curvature k. Looking forward to your answer!

Thanks for your question!

For the first question

If we fix the curvature c as a negative value (i.e., c<0), and then - c == abs(c), for example, -(-1) == abs(-1) = 1.

Therefore, we directly set a k, a positive value, and apply +1 in the equations, equivalent to -(-1).

For the second question

Using k or 1/k are both acceptable, and this is decided by how you define the curvature, but it's important to be consistent. In the code, +k represents that the curvature is −1/k.

But why do we use +k in the code? since it is easy to implement and optimize the parameters.

Thank you for your answer! I saw that the curvature was described as k < 0, which led me to think that k is also negative in the code. Now I understand. Thanks again!