Closed yunzqq closed 1 year ago
Yes, it's not used. We don't need to explicitly derive the value of c as it is implicitly absorbed in the softmax computation.
Yes, it's not used. We don't need to explicitly derive the value of c as it is implicitly absorbed in the softmax computation.
Thank you very much. For the implementation, r1,r2>0, what strategy is used to ensure this, just do the square operation?
So there are actually two ways to do this: 1) r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method. 2) r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.
So there are actually two ways to do this:
- r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method.
- r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.
Thank you very much! For tensorflow, is there the implmentation?
Sorry there is nothing on top of my head at the moment as I haven't used tensorflow for a while. That being said, could you please let me know which mode (graph vs eager execution) is more prevalent recently? We can perhaps come up with a solution together here.
Sorry there is nothing on top of my head at the moment as I haven't used tensorflow for a while. That being said, could you please let me know which mode (graph vs eager execution) is more prevalent recently? We can perhaps come up with a solution together here.
Thank you very much for your response! I may have a try and will consult you if any question. Thank you again!
So there are actually two ways to do this:
- r1.data = r1.data.clamp(min=1e-2): This is equivalent to doing projection, and we still get meaningful gradients for subsequent computations even if r1 is out of range at some point. We went for this method.
- r1 = r1.clamp(min=1e-2): All the gradients of subsequent computations will be zeroed out if r1 is out of range, which is not so great.
Thank you for your response. I want to consult another detail, is the \sqrt(d_k) not used to scale qkT, right?
Hi,
I think you meant $\sqrt{d}$ used in eq. (1) and (2)? If that's the case, then $\sqrt{d}$ is used to scale $q^\top k$. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.
Hi,
I think you meant d used in eq. (1) and (2)? If that's the case, then d is used to scale q⊤k. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.
Thank you very much for your response!
Hi,
I think you meant d used in eq. (1) and (2)? If that's the case, then d is used to scale q⊤k. I just found that we missed a left parenthesis in the fraction expression right after Lemma 1. Sorry if you are confused by this typo.
I feel sorry to disturb you again. So the attention is calculated as softmax((qTk+Bias+MASK)/\sqrt(d)) right?
Ok, so this is our fault. In the implementation, we apply the $\sqrt{d}$ scaling here. Then we add the KERPLE bias here. Finally, the attention mask is added here. Simply put, it should be softmax(( $\frac{q^\top k}{\sqrt{d}}$+Bias+MASK )). This is inconsistent with equations 1 and 2 in the paper, though it's just a $\sqrt{d}$ scaling difference so it's still cpd (can be absorbed in $r_1$ as long as $\sqrt{d}>0$). Thanks for catching this and we will update the paper once we get a chance. In the meanwhile, please feel free to ask more questions if there is any. Thanks!
Really appreciate you for your detailed response!!!
Is the value of c not used during the experiment, right?