Closed JesusPinedaC closed 6 months ago
@JesusPinedaC Is there an activation in there too?
Not explicitly.
This block is used to define the two transformer encoder layer sub-modules.
The first comprises multihead attention (the layer, in this case) followed by dropout, skip and normalization. No activation at this point in any of the operations.
The second module is the feedforward. Here (as defined in the original paper) Layer is an MLP which consists of two dense layers with a Relu activation in between. Then dropout, skip and normalization are carried out in the same way as in the attention module.
In short, layer is where the core processing is carried out including non-linearities.
Should we call "layer" differently to be clearer?
Ok! Makes sense. We might want to differentiate the naming, but i'm not sure what we would call it instead. @giovannivolpe any inputs? If we call slots for learnable modules like Conv2d
"layer"
, then should we have another name for general non-linearities that can be layer
+activation
.
Actually, classically, a layer would be the learnable + non-linearity not just the learnable. So maybe we should instead rename single learnable modules?
@JesusPinedaC @BenjaminMidtvedt I think it's reasonable to call layer a structure with learnable + activation. I think we can also keep single learnable modules called layer too.
Ok, then good to merge to me
This pull request introduces the LayerDropoutSkipNormalization block to Transformer models. This block is flexible and allows for easy changes to the order of its components. It also supports both Tensor inputs and dictionaries.