Open Satan012 opened 5 years ago
I am also confused,the code use: tf.reduce_sum(keys, axis=-1)
to calculate the mask.
where keys = Positional Encoding + Embedding layers.
for Positional Encoding != 0 embedding = 0 keys != 0
so, how can it get the mask?
@audier You may pass the non-position-encoded self.enc and self.dec as two more arguments via multihead_attention to _keymasks and _querymasks, seperately.
@gitfourteen thank you for your reply, i realized that's a defect in this project. it performs better after I add the arguments.
the padding part is redundancy and should not be included in calculating the loss