Try to reproduce results

Hello, I am trying to reproduce your results of the Pascal layer. But it appears the results I am getting are worse than the results acquired by the vanilla version. I am trying to understand if I missed something. These are the steps I did:

I parsed the WMT and the newstest sentences using udpipe.
I created the parent scaled masks from the ud parsings (according to the equation described in your paper).
I added the masks to the input.
for each sentence I multiplied the attention logits with the equivalent mask.

I was wondering if you did the same and if you encountered any decrease in performance at first, and if so, what did you do to bypass this obstacle. Thanks!

e-bug / pascal

Try to reproduce results #8