Why 6 blocks of multi-head attention in the encoder and decoder? - Githubissues

Kyubyong / transformer

A TensorFlow Implementation of the Transformer: Attention Is All You Need

Apache License 2.0

4.28k stars 1.3k forks source link

Why 6 blocks of multi-head attention in the encoder and decoder? #129

Open xus-stack opened 5 years ago

xus-stack commented 5 years ago

Does this proved to enhance performance?