TransformerLensOrg / TransformerLens

A library for mechanistic interpretability of GPT-style language models
https://transformerlensorg.github.io/TransformerLens/
MIT License
1.59k stars 305 forks source link

[Proposal] Add MVP Support For 1-2 Models Per-Modality #710

Open 4gatepylon opened 2 months ago

4gatepylon commented 2 months ago

Is this out of scope? I hope not, would be nice to have a one-stop shop for interpretability tooling.

Proposal

It should be easy to get the most bare-bones interpretability research off the ground for models that are not just transformer language models. Obviously, transformer lens should not have to support every model ever out there, but I think it would be cool to support just 1-2 very popular models per-modality.

Not sure what I think about diffusion.

The scope of this is simple

  1. Be able to hook into basically a prototypical model for any modality
  2. Be able to have easy functionality to get step-wise computation when you are doing some iterative model

With these two features it's easy to train SAEs on top etc... (even if it isn't optimally efficient).

Motivation

There are resources out there for this stuff, but it is a little scattered and often there isn't a nice tutorial to just "get SAEs for my music gen model" for example.

Pitch

This is not that hard. All it entails is:

  1. Models properly extend HookedRootModule and have HookPoints]
  2. Tests that make sure this works
  3. Some sort of tutorial ipynb to do basic steering or train an SAE on a layer
  4. (BONUS) A single not-bad trained SAE and steering example per-modality (this may leak into SAE Lens or be out of scope)

It might be possible to get some code-gen or AI to generate basic versions of this?

It has benefits:

  1. Help us work through the generalization steps that are necessary eventually if we ever want to have a one-stop tool-shop for interpretability tooling, while NOT requiring us to tackle actually unclear questions like "how do I do attention table visualization for sound?"
  2. Speeds people up.

There's a lot of little questions that come up all the time like:

These should be as easy as run_with_hooks on a model that works out of the box + call some function on your activations. These things could take like a couple hours and not be bug prone instead of twice that and be somewhat finnicky.

Alternatives

The current paradigm is that you spend the first few ours of a project on getting hook-points and SAEs integrated. It's not the end of the world. You can also use pytorch hooks, but we use transformer lens for a reason.

Checklist

bryce13950 commented 1 month ago

This is a lot more complicated than it seems. TransformerLens currently supports 185 models. If you look through the source code of transformers, you will see that every model architecture has its own implementation. TransformerLens is already trying to condense that into unified implementations, which has caused a ton of problems. Models that used to work fine over time seem to break without much notice due to someone maybe tweaking something for a new model, without realizing that one of the other 185 models has now become inaccurate. I am currently in the process of auditing every single model, and fixing all implementations without breaking other implementations. This turns into a lot of additional components to handle specific nuances for some models, and multiple layers of complexity that grows substantially with every model that is supported.

Now to take what we have, and then add whole new types of models will exponentially grow the existing issues. What I think is a better solution would be to turn TransformerLens into a platform by adding programatical hook points, and allow someone to essentially build plugins that can for instance, add support for vision models. This would open up a whole myriad of new possibilities for the development of larger interpretability built on top of TransformerLens, and allow for people to build and extend the code in ways that we cannot yet imagine. That would also allow us to focus making TransformerLens as good at Transformer interpretability as it possibly can be, instead of trying to make it a once size fits all solution.

For extending TransformerLens, there are quite a bit of pieces in the code that need to be cleaned up to allow for this sort of capability, but there are places where it can feasibly start. If you are interested in doing something like this, then I can add some experimental points where you could hook into with a child project. It is a pretty low priority at the moment, but it is something far out into the future to begin working on. I would be happy to start playing around with it now if there is a reason to do so.

ashwath98 commented 1 month ago

Hey, I'm interested in creating support for vision models (I have to do it for one of the baselines for my project), such as one of the VITs. Could you point me where to look to help do this?

bryce13950 commented 1 month ago

Which models do you want to use? I have helped a couple people get them up and running, and I am sure they would be happy to share their code for the models they have used. If the models do not intersect, then I would still be happy to help you troubleshoot it.