Closed sijia1999 closed 2 years ago
The K
in our paper is the negative curvature, we simply use K=-1
in our code. And when calculating the weights of attention, a smaller distance should indicate higher similarity. Therefore, we feed the negative Lorentzian distance into the softmax function to make similar tokens obtain higher weight.
Thank you for your answer!
Hello, I encountered one problem again while reading the code. May I ask how the scale initialization value is determined and what factors it is related to. If the initialization value is not set properly, will the result of embedding dimension with low dimension such as 8 and high dimension such as 64 be similar?
self.scale.exp()
corresponds to the \lambda inside \phi(Wx, v) in equation (3) of the paper. We set it to be approximately 10 empirically (exp(2.3)=9.97). We actually did not do much searching on it. If you encounter any stability problems, you can try to lower the initialization value.
Generally, a larger dimension will bring better performance, but there's no guarantee that this will always be the case, and it's also largely task-dependent. I can only say that the initialization of this value could be one of the reasons.
Ok, thank you again for your answer, I will go to further study!
May I ask why squared Lorentzian distance is -2/k-2<a,b> in the code instead of 2/k-2<a,b> in the paper