Closed egg-west closed 5 years ago
@egg-west Well the reason why I used this layer norm is "Attention All you need" implementation Annotated Transformer
used this code, and just copied from there. So.. if anyone can answer this question, would be seriously awesome
I believe that they should do similar things, however there is a difference in implementation.
For a given input:
x = torch.tensor([1.,0.,0.,0.])
The Annotated Transformer version gives the output:
tensor([ 1.5000, -0.5000, -0.5000, -0.5000], grad_fn=<ThAddBackward>)
While torch.nn.LayerNorm gives:
tensor([ 1.7320, -0.5773, -0.5773, -0.5773], grad_fn=<AddcmulBackward>)
The layer_norm implementation in PyTorch is here: https://github.com/pytorch/pytorch/blob/cca247635c6edb323176eeac7a18d3e9ab71c558/caffe2/python/helpers/normalization.py
@egg-west Is your question is solved? 👍
Thank you for your clarification, I guess pulling the epsilon out of sqrt may speed up the computation. But yes, they did the same thing.
I am wondering why don't you use the standard nn version of LayerNorm? I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}
Could you clarify these 2 approaches?