In the first attention block,
You get hidden features using a filter AttnW to conv. with state outputs,
I believe that's the formula W h_k,
BUT you also have state outputs pass through a linear layer and get y
Then you add y and hidden features then pass through an tanh then multiply by a matrix
In the paper, I only see tanh(W h_k)
Also in your code there is AttnV,
where I can't find corresponding description in the paper.
The paper only has gate V and gate W
Could you kindly explain this?
I am really confusing.
Thank you!
In the first attention block, You get hidden features using a filter AttnW to conv. with state outputs, I believe that's the formula W h_k, BUT you also have state outputs pass through a linear layer and get y Then you add y and hidden features then pass through an tanh then multiply by a matrix
In the paper, I only see tanh(W h_k)
Also in your code there is AttnV, where I can't find corresponding description in the paper.
The paper only has gate V and gate W
Could you kindly explain this? I am really confusing. Thank you!