Closed danbraunai-apollo closed 6 months ago
I should note that I haven't actually built any graphs with this change. Before merging we should have a look at the graphs when the node layers include ["ln1", "ln1_out",
We should also probably include a ln
Comparison of rib build graphs on the last layer of pythia. (1-0) basis and non-centered so not the most reliable graph.
Without splitting layer norm
With split layer norm
I think we mostly see what we expect here, although:
But again, this is a somewhat mistaken basis.
See this slack thread for some further experiments: https://apolloresearchhq.slack.com/archives/C06484S5UF9/p1706176556896229
Split layernorm into two sequential modules
Description
Related Issue
Closes #299
How Has This Been Tested?
Same tests pass. Notably, the sequential model with the new layer norm layers get the same per-module outputs as transformerlens for various models. Also, folding the bias does not affect the output.
Does this PR introduce a breaking change?
No, since we kept the ln1, ln2, ln_final names around, even though they don't really describe what the layer does (calculates variance).