callummcdougall / ARENA_2.0

Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
190 stars 78 forks source link

The claim about splitting a circuit across two heads #25

Open Mihonarium opened 2 weeks ago

Mihonarium commented 2 weeks ago

I first pointed this out on Slack back in 2023, but this seems to not have been fixed.

There's this claim in intro to mech interp: image

The OV circuits being rank-64 is not the full reason why the head split happens. You can easily train a rank 64 matrix (two matrices 768x64 @ 64x768) to get 98.9% top-1 accuracy and 99.4% top-5 accuracy on the full OV circuit, which is better than the top-5 of the model's combined rank 128 matrix.

Since a single rank-64 matrix can achieve a much better approximation of the desired 50K x 50K matrix than the rank-128 matrix that the model uses, the two heads probably aren't trying to only blindly copy the tokens, and the rank, possibly, isn't a good explanation for the split; the heads are probably doing something else as well.

(I was using the MLAB2 w2d4 notebook, so the code might be somewhat incompatible, but it's pretty straightforward and should be easy to reproduce: https://gist.github.com/Mihonarium/7b4b9a4a17c8f1b1c67dc143b9225d53.)

callummcdougall commented 1 week ago

Thanks for adding this and supplying the link! I'm sure it is possible to specifically train this, although models' attention heads aren't being trained directly to be faithful copying circuits, they're being trained to predict next tokens, and having high-fidelity copying heads to implement things like induction circuits is just one of many different ways to do this. I notice that your attached training code specifically tries to train this head to be the identity matrix using a tailored loss function and no weight decay, which is an easier setting to learn this specific pattern in.

Additionally, it's often the case that two different components will both start to learn some task X before either learns to do task X across some full dataset, leading to this capability being split across heads. For example, one head might learn copying as part of an induction circuit, and another might learn copying so it can copy repeated names or proper nouns in a sentence - if these cases don't overlap, then both capabilities could plausibly be learned independently. However I do agree that these heads are very plausibly doing something other than copying!