squared Lorentzian distance in code and paper

chenweize1998 / fully-hyperbolic-nn

Code for paper Fully Hyperbolic Neural Networks

74 stars 12 forks source link

squared Lorentzian distance in code and paper #2

Closed sijia1999 closed 2 years ago

sijia1999 commented 2 years ago

May I ask why squared Lorentzian distance is -2/k-2<a,b> in the code instead of 2/k-2<a,b> in the paper 1648107761(1) 1648107796(1)

chenweize1998 commented 2 years ago

The K in our paper is the negative curvature, we simply use K=-1 in our code. And when calculating the weights of attention, a smaller distance should indicate higher similarity. Therefore, we feed the negative Lorentzian distance into the softmax function to make similar tokens obtain higher weight.

sijia1999 commented 2 years ago

Thank you for your answer！

sijia1999 commented 2 years ago

Hello, I encountered one problem again while reading the code. May I ask how the scale initialization value is determined and what factors it is related to. If the initialization value is not set properly, will the result of embedding dimension with low dimension such as 8 and high dimension such as 64 be similar?

chenweize1998 commented 2 years ago

self.scale.exp() corresponds to the \lambda inside \phi(Wx, v) in equation (3) of the paper. We set it to be approximately 10 empirically (exp(2.3)=9.97). We actually did not do much searching on it. If you encounter any stability problems, you can try to lower the initialization value. Generally, a larger dimension will bring better performance, but there's no guarantee that this will always be the case, and it's also largely task-dependent. I can only say that the initialization of this value could be one of the reasons.

sijia1999 commented 2 years ago

Ok, thank you again for your answer, I will go to further study！