Hello,
I am trying to reproduce your results of the Pascal layer. But it appears the results I am getting are worse than the results acquired by the vanilla version.
I am trying to understand if I missed something. These are the steps I did:
I parsed the WMT and the newstest sentences using udpipe.
I created the parent scaled masks from the ud parsings (according to the equation described in your paper).
I added the masks to the input.
for each sentence I multiplied the attention logits with the equivalent mask.
I was wondering if you did the same and if you encountered any decrease in performance at first, and if so, what did you do to bypass this obstacle.
Thanks!
Hello, I am trying to reproduce your results of the Pascal layer. But it appears the results I am getting are worse than the results acquired by the vanilla version. I am trying to understand if I missed something. These are the steps I did:
I was wondering if you did the same and if you encountered any decrease in performance at first, and if so, what did you do to bypass this obstacle. Thanks!