TransformerLensOrg / TransformerLens

A library for mechanistic interpretability of GPT-style language models
https://transformerlensorg.github.io/TransformerLens/
MIT License
1.55k stars 301 forks source link

[Proposal] Add support for TracrBench #704

Open HannesThurnherr opened 2 months ago

HannesThurnherr commented 2 months ago

Proposal

Add support for TracrBench transformers

Motivation

I and @JeremyAlain recently wrote a paper in which we introduced a dataset of 121 tracr-transformers. Tracr transformers are meant to be used as test beds or "sanity-checks" in the development of novel interpretability methods. To make them as accessible as possible we convert from the deepmind-internal "haiku" framework to Hooked Transformers (following this template made by Neel). We would like and have been asked by multiple people to make these toy models available from within transformerlens.

Pitch

We have all the models uploaded to huggingface and i have code to load the models. It's a little different from the code used to load the typical LLMs. Since the model requires input and output encoders, we wrap the hooked transformer class in another simple class called "TracrModel".

My question is, whether this is possible and if so, where to put this code/the tracr_models.py file.

Alternatives

An alternative would be to integrate the code to download the tracr models for use within transformerlens in another repo.

bryce13950 commented 2 months ago

Can you add a link to the models on HuggingFace, and a link to the source code? Most likely, you will be able to utilize the majority of existing component to add this, but there are going to need to be some new components created.

neelnanda-io commented 2 months ago

My personal inclination would be to just make this into another repo that builds on TransformerLens. What's the case for making this part of the core repo?

On Wed, 14 Aug 2024, 05:40 Hannes Thurnherr, @.***> wrote:

Proposal

Add support for TracrBench transformers Motivation

I and @JeremyAlain https://github.com/JeremyAlain recently wrote a paper in which we introduced a dataset of 121 tracr-transformers. Tracr transformers are meant to be used as test beds or "sanity-checks" in the development of novel interpretability methods. To make them as accessible as possible we convert from the deepmind-internal "haiku" framework to Hooked Transformers (following this https://colab.research.google.com/github/TransformerLensOrg/TransformerLens/blob/main/demos/Tracr_to_Transformer_Lens_Demo.ipynbtemplate made by Neel). We would like and have been asked by multiple people to make these toy models available from within transformerlens. Pitch

We have all the models uploaded to huggingface and i have code to load the models. It's a little different from the code used to load the typical LLMs. Since the model requires input and output encoders, we wrap the hooked transformer class in another simple class called "TracrModel".

My question is, whether this is possible and if so, where to put this code/the tracr_models.py file. Alternatives

An alternative would be to integrate the code to download the tracr models for use withing transformerlens in another repo.

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/issues/704, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKPH5ACQJMAU2452TTLZRNF3BAVCNFSM6AAAAABMQILXB2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DKNZSGY2DKMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

HannesThurnherr commented 2 months ago

Can you add a link to the models on HuggingFace, and a link to the source code? Most likely, you will be able to utilize the majority of existing component to add this, but there are going to need to be some new components created.

I've added the links to the issue. The code is in the "another repo" link, linking to our TracrBench repo.

My personal inclination would be to just make this into another repo that builds on TransformerLens. What's the case for making this part of the core repo?

We are happy to make it into its own repo. The case for making it part of TransformerLens is that the point of the dataset and the paper was to make the use of tracr for evaluating interp methods as easy as possible. Integrating this directly into TransformerLens would really help with that. If we make it into its own repo, maybe the project could be mentioned in the docs somewhere?