TransformerLensOrg / TransformerLens

A library for mechanistic interpretability of GPT-style language models
https://transformerlensorg.github.io/TransformerLens/
MIT License
1.39k stars 269 forks source link

[Proposal] Demo and Tutorial on Patchscopes and "Patching + Generation" #680

Closed HenryCai11 closed 3 weeks ago

HenryCai11 commented 1 month ago

Proposal

UPDATE: Demo and Tutorial on Patchscopes and "Patching + Generation"

DEPRECATED: Replication of the original causal tracing from the ROME paper.

Motivation

I found that the original causal tracing method hasn't been supported here, and I think it has some advantages over the current activation patching method. For example, corruption with Gaussian noise might preserve more semantic information from the original sentence than corruption by changing words.

Pitch

To replicate the original causal tracing method from the ROME paper (https://arxiv.org/abs/2202.05262)

Alternatives

I also consider replicating the Patchscope here, which is also mentioned in issue #500. Since Patchscope can be considered as a more general framework for this kind of patching/intervention-based methods, implementing it here can also make causal tracing available. I'd like to open another issue for the replication of Patchscope later.

Additional context

I've implemented a version locally, and would like to put some examples here, comparing results from my implementation and from the original implementation.

Checklist

neelnanda-io commented 1 month ago

Thanks for the proposal! I think it's an interesting technique, but in follow-up work https://arxiv.org/abs/2309.16042 we found the Gaussian noise seemed to break the model in some ways, so I wouldn't want to feature it too prominently.

I think the technique is easy enough to implement with TransformerLens as is (you just need to use hook_embed to replace with Gaussian noise, collect new activations, and then patch), so I don't see the need for any changes to the core codebase. A demo could work, but I'd probably suggest you just do that yourself and eg write a blog post about it.

I'd be happy to get a demo of Patchscopes contributed though, since I think that's a more exciting technique that I'd be happy to see signal boosted, and I don't think our existing tutorials show how to combine patching and generation well.

On Tue, 16 Jul 2024 at 11:33, Min Cai @.***> wrote:

Proposal

Replication of the original causal tracing from the ROME paper. Motivation

I found that the original causal tracing method hasn't been supported here, and I think it has some advantages over the current activation patching method. For example, corruption with Gaussian noise might preserve more semantic information from the original sentence than corruption by changing words. Pitch

To replicate the original causal tracing method from the ROME paper ( https://arxiv.org/abs/2202.05262) Alternatives

I also consider replicating the Patchscope here, which is also mentioned in issue #500 https://github.com/TransformerLensOrg/TransformerLens/issues/500. Since Patchscope can be considered as a more general framework for this kind of patching/intervention-based methods, implementing it here can also make causal tracing available. I'd like to open another issue for the replication of Patchscope later. Additional context

I've implemented a version locally, and would like to put some examples here, comparing results from my implementation and from the original implementation. Checklist

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/issues/680, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKNGNUBREYWIQLSYPH3ZMTZGTAVCNFSM6AAAAABK6JPTL2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTAOBQGI3TQNQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

HenryCai11 commented 1 month ago

@neelnanda-io Thanks for the clarification! Yeah, I've roughly tried combining patching and generation, and found that there might be several subtleties need to be concerned. I'd like to contribute a tutorial on it and further an implementation of Patchscopes.

Maybe I'll just use this thread for a "tutorial and demo on patching + generation and Patchscopes", and make PR(s) later.

bryce13950 commented 1 month ago

@HenryCai11 Do you need any help getting started with this? If you want to put the demo together, I would be happy to walk you through a couple steps in the process. Any new demos added to the project need to be covered by the CI, so there are a couple nuances to how they should be put together. A lot of how to do this can be seen within other demos that are currently covered, but I am happy to walk through any specific questions you may have on how to get started.

HenryCai11 commented 1 month ago

@HenryCai11 Do you need any help getting started with this? If you want to put the demo together, I would be happy to walk you through a couple steps in the process. Any new demos added to the project need to be covered by the CI, so there are a couple nuances to how they should be put together. A lot of how to do this can be seen within other demos that are currently covered, but I am happy to walk through any specific questions you may have on how to get started.

Thanks @bryce13950 ! I have now started to write up the demos. I'll see if it'll be more convenient to put them together, otherwise I think separating them would be nice.

HenryCai11 commented 1 month ago

Hi @bryce13950 , I've made a pull request #692. I finally decided to put both of them in the same notebook and added clear comments of how to implement them. I also passed the notebook test locally. Please let me know if there's anything else that I need to do. Thanks!