AnswerDotAI / bert24

Apache License 2.0
25 stars 3 forks source link

Set initial layers independently from the of rest of the model #64

Closed warner-benjamin closed 2 weeks ago

warner-benjamin commented 2 weeks ago

This PR allows setting the first BERT layer independent of the rest of the model layers. Eg, a PreNorm layer followed by Parallel Attention layers. flex-bert-rope-parallel-firstprenorm.yaml is an example of this feature.

This PR also adds the ability to turn off duplicate PreNorm norms when embed_norm=True via skip_first_prenorm. The embedding norm additionally normalizes the residuals, which is why we would want to turn it on and have the first prenorm normalization layer turned off.

warner-benjamin commented 2 weeks ago

Now any number of initial layers can be set to a different type