danbraunai-apollo commented 6 months ago

Split layernorm into two sequential modules

Description

Split layer norm into a variance calculation and then a layernorm given a variance.
We keep ln1, ln2, and ln_final module_ids, but these now correspond to the variance calculation (computed in Variance and DualVariance).
We add ln1_out, ln2_out, and ln_final_out module_ids, which now correspond to LayerNormPre and DualLayerNormPre.
Gets rid of the Folded layernorm modules and adds a _exclude_final_dim flag in the non-folded Variance and LayerNormPre classes to account for the folded calculation. We could use less code if we had a parent class FoldedModule which just implements this exclude_final_dim method, and maybe a function for getting the residual with/without the final dim, but this seems like a weird type of class to me, because all it does is toggle a flag. But maybe it's better.

Related Issue

Closes #299

How Has This Been Tested?

Same tests pass. Notably, the sequential model with the new layer norm layers get the same per-module outputs as transformerlens for various models. Also, folding the bias does not affect the output.

Does this PR introduce a breaking change?

No, since we kept the ln1, ln2, ln_final names around, even though they don't really describe what the layer does (calculates variance).

danbraunai-apollo commented 6 months ago

I should note that I haven't actually built any graphs with this change. Before merging we should have a look at the graphs when the node layers include ["ln1", "ln1_out", , "ln2", "ln2_out"] to see if this change does what we're hoping it does.

We should also probably include a ln_out node layer in one of our tests to show that we can actually build a graph at this layer.

nix-apollo commented 6 months ago

Comparison of rib build graphs on the last layer of pythia. (1-0) basis and non-centered so not the most reliable graph.

Without splitting layer norm tinystories-whole-ln_rib_graph

With split layer norm tinystories-split-ln_rib_graph

I think we mostly see what we expect here, although:

the variance seems to be spread across a couple rib directions
the non-variance directions are somewhat messy. You'd hope they could be pretty much exactly 1-1.

But again, this is a somewhat mistaken basis.

nix-apollo commented 6 months ago

See this slack thread for some further experiments: https://apolloresearchhq.slack.com/archives/C06484S5UF9/p1706176556896229

ApolloResearch / rib