Set initial layers independently from the of rest of the model

This PR allows setting the first BERT layer independent of the rest of the model layers. Eg, a PreNorm layer followed by Parallel Attention layers. flex-bert-rope-parallel-firstprenorm.yaml is an example of this feature.

This PR also adds the ability to turn off duplicate PreNorm norms when embed_norm=True via skip_first_prenorm. The embedding norm additionally normalizes the residuals, which is why we would want to turn it on and have the first prenorm normalization layer turned off.

AnswerDotAI / bert24

Set initial layers independently from the of rest of the model #64