chijames / KERPLE

Apache License 2.0
17 stars 1 forks source link

About value of c in position embedding and enforcing the range of r1 r2 #1

Closed yunzqq closed 1 year ago

yunzqq commented 1 year ago

Is the value of c not used during the experiment, right?

chijames commented 1 year ago

Yes, it's not used. We don't need to explicitly derive the value of c as it is implicitly absorbed in the softmax computation.

yunzqq commented 1 year ago

Yes, it's not used. We don't need to explicitly derive the value of c as it is implicitly absorbed in the softmax computation.

Thank you very much. For the implementation, r1,r2>0, what strategy is used to ensure this, just do the square operation?

chijames commented 1 year ago

So there are actually two ways to do this: 1) r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method. 2) r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.

yunzqq commented 1 year ago

So there are actually two ways to do this:

  1. r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method.
  2. r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.

Thank you very much! For tensorflow, is there the implmentation?

chijames commented 1 year ago

Sorry there is nothing on top of my head at the moment as I haven't used tensorflow for a while. That being said, could you please let me know which mode (graph vs eager execution) is more prevalent recently? We can perhaps come up with a solution together here.

yunzqq commented 1 year ago

Sorry there is nothing on top of my head at the moment as I haven't used tensorflow for a while. That being said, could you please let me know which mode (graph vs eager execution) is more prevalent recently? We can perhaps come up with a solution together here.

Thank you very much for your response! I may have a try and will consult you if any question. Thank you again!

yunzqq commented 1 year ago

So there are actually two ways to do this:

  1. r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method.
  2. r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.

Thank you for your response. I want to consult another detail, is the \sqrt(d_k) not used to scale qkT, right?

chijames commented 1 year ago

Hi,

I think you meant $\sqrt{d}$ used in eq. (1) and (2)? If that's the case, then $\sqrt{d}$ is used to scale $q^\top k$. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.

yunzqq commented 1 year ago

Hi,

I think you meant d used in eq. (1) and (2)? If that's the case, then d is used to scale q⊤k. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.

Thank you very much for your response!

yunzqq commented 1 year ago

Hi,

I think you meant d used in eq. (1) and (2)? If that's the case, then d is used to scale q⊤k. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.

I feel sorry to disturb you again. So the attention is calculated as softmax((qTk+Bias+MASK)/\sqrt(d)) right?

chijames commented 1 year ago

Ok, so this is our fault. In the implementation, we apply the $\sqrt{d}$ scaling here. Then we add the KERPLE bias here. Finally, the attention mask is added here. Simply put, it should be softmax(( $\frac{q^\top k}{\sqrt{d}}$+Bias+MASK )). This is inconsistent with equations 1 and 2 in the paper, though it's just a $\sqrt{d}$ scaling difference so it's still cpd (can be absorbed in $r_1$ as long as $\sqrt{d}>0$). Thanks for catching this and we will update the paper once we get a chance. In the meanwhile, please feel free to ask more questions if there is any. Thanks!

yunzqq commented 1 year ago

Really appreciate you for your detailed response!!!