Open 4gatepylon opened 2 months ago
This is a lot more complicated than it seems. TransformerLens currently supports 185 models. If you look through the source code of transformers, you will see that every model architecture has its own implementation. TransformerLens is already trying to condense that into unified implementations, which has caused a ton of problems. Models that used to work fine over time seem to break without much notice due to someone maybe tweaking something for a new model, without realizing that one of the other 185 models has now become inaccurate. I am currently in the process of auditing every single model, and fixing all implementations without breaking other implementations. This turns into a lot of additional components to handle specific nuances for some models, and multiple layers of complexity that grows substantially with every model that is supported.
Now to take what we have, and then add whole new types of models will exponentially grow the existing issues. What I think is a better solution would be to turn TransformerLens into a platform by adding programatical hook points, and allow someone to essentially build plugins that can for instance, add support for vision models. This would open up a whole myriad of new possibilities for the development of larger interpretability built on top of TransformerLens, and allow for people to build and extend the code in ways that we cannot yet imagine. That would also allow us to focus making TransformerLens as good at Transformer interpretability as it possibly can be, instead of trying to make it a once size fits all solution.
For extending TransformerLens, there are quite a bit of pieces in the code that need to be cleaned up to allow for this sort of capability, but there are places where it can feasibly start. If you are interested in doing something like this, then I can add some experimental points where you could hook into with a child project. It is a pretty low priority at the moment, but it is something far out into the future to begin working on. I would be happy to start playing around with it now if there is a reason to do so.
Hey, I'm interested in creating support for vision models (I have to do it for one of the baselines for my project), such as one of the VITs. Could you point me where to look to help do this?
Which models do you want to use? I have helped a couple people get them up and running, and I am sure they would be happy to share their code for the models they have used. If the models do not intersect, then I would still be happy to help you troubleshoot it.
Is this out of scope? I hope not, would be nice to have a one-stop shop for interpretability tooling.
Proposal
It should be easy to get the most bare-bones interpretability research off the ground for models that are not just transformer language models. Obviously, transformer lens should not have to support every model ever out there, but I think it would be cool to support just 1-2 very popular models per-modality.
Not sure what I think about diffusion.
The scope of this is simple
With these two features it's easy to train SAEs on top etc... (even if it isn't optimally efficient).
Motivation
There are resources out there for this stuff, but it is a little scattered and often there isn't a nice tutorial to just "get SAEs for my music gen model" for example.
Pitch
This is not that hard. All it entails is:
It might be possible to get some code-gen or AI to generate basic versions of this?
It has benefits:
There's a lot of little questions that come up all the time like:
These should be as easy as
run_with_hooks
on a model that works out of the box + call some function on your activations. These things could take like a couple hours and not be bug prone instead of twice that and be somewhat finnicky.Alternatives
The current paradigm is that you spend the first few ours of a project on getting hook-points and SAEs integrated. It's not the end of the world. You can also use pytorch hooks, but we use transformer lens for a reason.
Checklist