when you calculate loss, you may did not consider the padding part

Kyubyong / transformer

A TensorFlow Implementation of the Transformer: Attention Is All You Need

Apache License 2.0

4.25k stars 1.29k forks source link

when you calculate loss, you may did not consider the padding part #63

Open Satan012 opened 5 years ago

Satan012 commented 5 years ago

the padding part is redundancy and should not be included in calculating the loss

hongwen-sun commented 5 years ago

I am also confused，the code use: tf.reduce_sum(keys, axis=-1) to calculate the mask. where keys = Positional Encoding + Embedding layers. for , Positional Encoding != 0 embedding = 0 keys != 0 so, how can it get the mask?

gitfourteen commented 5 years ago

@audier You may pass the non-position-encoded self.enc and self.dec as two more arguments via multihead_attention to _keymasks and _querymasks, seperately.

hongwen-sun commented 5 years ago

@gitfourteen thank you for your reply, i realized that's a defect in this project. it performs better after I add the arguments.