Implementation Question: why scaling the embeddings by square root of hidden size?

THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group

BSD 3-Clause "New" or "Revised" License

703 stars 197 forks source link

Closed Epsilon-Lee closed 6 years ago

Epsilon-Lee commented 6 years ago

Do this influence the final performance? I haven't see it in Google's implementation. Many thanks!

if params.multiply_embedding_mode == "sqrt_depth":
        inputs = inputs * (hidden_size ** 0.5)

Glaceon31 commented 6 years ago

Hi, Google's implementation includes this in modalities.py

Epsilon-Lee commented 6 years ago

Thanks for pointing out, I missed that detail! BTW, do you have any insights/intuitions on why scale the embeddings? Many thanks.

Glaceon31 commented 6 years ago

I guess that maybe scaling is empirically better than no scaling. It is better to ask the Tensor2tensor developers for an accurate explanation.

Epsilon-Lee commented 6 years ago

Thanks a lot, I will do that.