This PR allows setting the first BERT layer independent of the rest of the model layers. Eg, a PreNorm layer followed by Parallel Attention layers. flex-bert-rope-parallel-firstprenorm.yaml is an example of this feature.
This PR also adds the ability to turn off duplicate PreNorm norms when embed_norm=True via skip_first_prenorm. The embedding norm additionally normalizes the residuals, which is why we would want to turn it on and have the first prenorm normalization layer turned off.
This PR allows setting the first BERT layer independent of the rest of the model layers. Eg, a PreNorm layer followed by Parallel Attention layers.
flex-bert-rope-parallel-firstprenorm.yaml
is an example of this feature.This PR also adds the ability to turn off duplicate PreNorm norms when
embed_norm=True
viaskip_first_prenorm
. The embedding norm additionally normalizes the residuals, which is why we would want to turn it on and have the first prenorm normalization layer turned off.