e-bug / pascal

[ACL 2020] Code and data for our paper "Enhancing Machine Translation with Dependency-Aware Self-Attention"
https://www.aclweb.org/anthology/2020.acl-main.147/
MIT License
22 stars 10 forks source link

question about the dist function #6

Closed lovodkin93 closed 3 years ago

lovodkin93 commented 3 years ago

Hello, I was going through your paper, and something didn't make sense with regard to the calculation of the D^p matrix. My confusion stems from the fact that the indices in the vector p are starting from 0 (in your example in figure 1, the word "eats", which is the parent of "monkey", "banana" and "eats" is defined as located in index 3, rather than 2, if we were to start the count from 0), whereas in the equation for calculating dist(p_t,j), j is counted from 0 rather than 1. Doesn't this create some sort of discrepancy? For example, in your example in figure 1, if I look at the first row, then for the five entries of this row we get in the exponent: (0,0): -((0-p_0)^2)/2 = -((0-2)^2)/2=-4/2=-2 (0,1): -((1-p_0)^2)/2 = -((1-2)^2)/2 = -1/2 (0,2): -((2-p_0)^2)/2 = -((2-2)^2)/2 = 0 (0,3): -((3-p_0)^2)/2 = -((3-2)^2)/2 = -1/2 (0,4): -((4-p_0)^2)/2 = -((4-2)^2)/2 = -2

So in fact we get that the highest value is in j=2 (corresponding to "eats"), even though the first word's parent is "monkey" (index 1). I would really appreciate if you could help me understand what I am missing. Thanks!

e-bug commented 3 years ago

Hi, nowhere in the paper we said that j is counted from 0. In fact, it is counted from 1 in Eq. (2). Regardless, as long as you count both j and p in the same way, you obtain the right values. That is, for your example: (1,1): -((1-p_0)^2)/2 = -((1-2)^2)/2=-4/2=-1/2 (1,2): -((2-p_0)^2)/2 = -((2-2)^2)/2 = 0 (1,3): -((3-p_0)^2)/2 = -((3-2)^2)/2 = -1/2 (1,4): -((4-p_0)^2)/2 = -((4-2)^2)/2 = -2 (1,5): -((5-p_0)^2)/2 = -((5-2)^2)/2 = -9/2