TransformerLensOrg / TransformerLens

A library for mechanistic interpretability of GPT-style language models
https://transformerlensorg.github.io/TransformerLens/
MIT License
1.17k stars 241 forks source link

Steering vanilla GPT2 with SAE vectors based on transformerlens version of GPT2 #642

Closed ianand closed 2 weeks ago

ianand commented 2 weeks ago

Question

@jbloomAus suggested I ask here in case @neelnanda-io or someone knows off hand.

Context: I'm trying to use SAE steering vectors from neuronpedia (e.g. https://www.neuronpedia.org/gpt2-small/5-res-jb/23) to steer a GPT2-small model using the technique in https://github.com/jbloomAus/SAELens/blob/8c67c2355211910bc5054ba9bc140e98424fa026/tutorials/using_an_sae_as_a_steering_vector.ipynb.

However, I'm interested in applying these SAE steering vectors to a vanilla GPT2 transformer that doesn't have all the modifications in https://github.com/TransformerLensOrg/TransformerLens/blob/main/further_comments.md (i.e. stock GPT2 small, with regular positional embeddings, without LayerNorm folding, etc).

I suspect the differences between the models (i.e. modified GPT2 in TransformerLens vs vanilla GPT2) means I can't use the already derived SAE W_dec vectors from gpt2-small-res-jb. Is that correct? In my experiments, I have tried it and it does not appear to work but I may be doing something wrong.

neelnanda-io commented 2 weeks ago

Hmm, the only way those modifications change the residual stream is by making it mean zero (over the d_model dimension). All other changes preserve layer input-output behaviour. This may break the SAE's ability to reconstruct the residual stream, but it shouldn't change the effect of steering with the decoder vector - it's just a vector. I don't know why it wouldn't be working

On Tue, 18 Jun 2024 at 15:37, Ishan Anand @.***> wrote:

Question

@jbloomAus https://github.com/jbloomAus suggested I ask here in case @neelnanda-io https://github.com/neelnanda-io or someone knows off hand.

Context: I'm trying to use SAE steering vectors from neuronpedia (e.g. https://www.neuronpedia.org/gpt2-small/5-res-jb/23) to steer a GPT2-small model using the technique in https://github.com/jbloomAus/SAELens/blob/8c67c2355211910bc5054ba9bc140e98424fa026/tutorials/using_an_sae_as_a_steering_vector.ipynb .

However, I'm interested in applying these SAE steering vectors to a vanilla GPT2 transformer that doesn't have all the modifications in https://github.com/TransformerLensOrg/TransformerLens/blob/main/further_comments.md (i.e. stock GPT2 small, with regular positional embeddings, without LayerNorm folding, etc).

I suspect the differences between the models (i.e. modified GPT2 in TransformerLens vs vanilla GPT2) means I can't use the already derived SAE W_dec vectors from gpt2-small-res-jb. Is that correct? In my experiments, I have tried it and it does not appear to work but I may be doing something wrong.

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/issues/642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKIQILBPAX2TSW2JZPDZIBA37AVCNFSM6AAAAABJQF3P6GVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DAMBSHEZDKNA . You are receiving this because you were mentioned.Message ID: @.***>

ianand commented 2 weeks ago

Thanks for the insight. Will have to dig in more when I have more time. Closing until I come back with an update.

ianand commented 2 weeks ago

Just a quick update.

Thanks for your guidance @neelnanda-io