It appears that to pass input to the LayerNorm, the tensor is reshaped into a 2D matrix (feature_size x (sequence_length x batch_size), then reshaped back after all the norm layers are done operating? I think this happens in multiple places (i.e. the @toNd macro).
That's right, the LayerNorm doesn't need to reshape the input. However, Flux.Dense doesn't accept dimension higher than 2, those reshapes are not for LayerNorm but for Dense.
It appears that to pass input to the
LayerNorm
, the tensor is reshaped into a 2D matrix (feature_size x (sequence_length x batch_size)
, then reshaped back after all the norm layers are done operating? I think this happens in multiple places (i.e. the@toNd
macro).https://github.com/chengchingwen/Transformers.jl/blob/fbc8bb3582189d770778377a96994685d2c0b41c/src/basic/transformer.jl#L61
Based on a recent Zulip topic, I think this isn't required due to Julia's broadcasting machinery.
As seen above, the
LayerNorm
body applied to a 3D tensor and reshaped tensor result in the same output.