Steering vanilla GPT2 with SAE vectors based on transformerlens version of GPT2

ianand commented 2 weeks ago

Question

@jbloomAus suggested I ask here in case @neelnanda-io or someone knows off hand.

Context: I'm trying to use SAE steering vectors from neuronpedia (e.g. https://www.neuronpedia.org/gpt2-small/5-res-jb/23) to steer a GPT2-small model using the technique in https://github.com/jbloomAus/SAELens/blob/8c67c2355211910bc5054ba9bc140e98424fa026/tutorials/using_an_sae_as_a_steering_vector.ipynb.

However, I'm interested in applying these SAE steering vectors to a vanilla GPT2 transformer that doesn't have all the modifications in https://github.com/TransformerLensOrg/TransformerLens/blob/main/further_comments.md (i.e. stock GPT2 small, with regular positional embeddings, without LayerNorm folding, etc).

I suspect the differences between the models (i.e. modified GPT2 in TransformerLens vs vanilla GPT2) means I can't use the already derived SAE W_dec vectors from gpt2-small-res-jb. Is that correct? In my experiments, I have tried it and it does not appear to work but I may be doing something wrong.

neelnanda-io commented 2 weeks ago

Hmm, the only way those modifications change the residual stream is by making it mean zero (over the d_model dimension). All other changes preserve layer input-output behaviour. This may break the SAE's ability to reconstruct the residual stream, but it shouldn't change the effect of steering with the decoder vector - it's just a vector. I don't know why it wouldn't be working

On Tue, 18 Jun 2024 at 15:37, Ishan Anand @.***> wrote:

Question

@jbloomAus https://github.com/jbloomAus suggested I ask here in case @neelnanda-io https://github.com/neelnanda-io or someone knows off hand.

Context: I'm trying to use SAE steering vectors from neuronpedia (e.g. https://www.neuronpedia.org/gpt2-small/5-res-jb/23) to steer a GPT2-small model using the technique in https://github.com/jbloomAus/SAELens/blob/8c67c2355211910bc5054ba9bc140e98424fa026/tutorials/using_an_sae_as_a_steering_vector.ipynb .

However, I'm interested in applying these SAE steering vectors to a vanilla GPT2 transformer that doesn't have all the modifications in https://github.com/TransformerLensOrg/TransformerLens/blob/main/further_comments.md (i.e. stock GPT2 small, with regular positional embeddings, without LayerNorm folding, etc).

I suspect the differences between the models (i.e. modified GPT2 in TransformerLens vs vanilla GPT2) means I can't use the already derived SAE W_dec vectors from gpt2-small-res-jb. Is that correct? In my experiments, I have tried it and it does not appear to work but I may be doing something wrong.

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/issues/642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKIQILBPAX2TSW2JZPDZIBA37AVCNFSM6AAAAABJQF3P6GVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DAMBSHEZDKNA . You are receiving this because you were mentioned.Message ID: @.***>

ianand commented 2 weeks ago

Thanks for the insight. Will have to dig in more when I have more time. Closing until I come back with an update.

ianand commented 2 weeks ago

Just a quick update.

Using the SAELens/TransformerLens steering vectors does indeed work on Vanilla GPT2.
Comparing temperature 0 outputs from the TransformersLens GPT2 model the logits/tokens are slightly different than vanilla GPT2 but semantically similar (measured by Vibes).
To get the desired effect for my test case, I have to use different values on the coefficient applied to the steering vector between vanilla GPT2 vs SAELens/TransformerLens GPT2. Which probably isn't surprising.
I was indeed making an error that prevented it from working and was also debugging by comparing logits directly which misled me given that the residual values are slightly different between the models (i.e. point number 2 above).

Thanks for your guidance @neelnanda-io

TransformerLensOrg / TransformerLens

Steering vanilla GPT2 with SAE vectors based on transformerlens version of GPT2 #642

Question