codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

The LayerNorm implementation #30

Closed egg-west closed 5 years ago

egg-west commented 5 years ago

I am wondering why don't you use the standard nn version of LayerNorm? I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?

codertimo commented 5 years ago

@egg-west Well the reason why I used this layer norm is "Attention All you need" implementation Annotated Transformer used this code, and just copied from there. So.. if anyone can answer this question, would be seriously awesome

briandw commented 5 years ago

I believe that they should do similar things, however there is a difference in implementation.

For a given input: x = torch.tensor([1.,0.,0.,0.]) The Annotated Transformer version gives the output: tensor([ 1.5000, -0.5000, -0.5000, -0.5000], grad_fn=<ThAddBackward>)

While torch.nn.LayerNorm gives: tensor([ 1.7320, -0.5773, -0.5773, -0.5773], grad_fn=<AddcmulBackward>)

The layer_norm implementation in PyTorch is here: https://github.com/pytorch/pytorch/blob/cca247635c6edb323176eeac7a18d3e9ab71c558/caffe2/python/helpers/normalization.py

codertimo commented 5 years ago

@egg-west Is your question is solved? 👍

egg-west commented 5 years ago

Thank you for your clarification, I guess pulling the epsilon out of sqrt may speed up the computation. But yes, they did the same thing.