callummcdougall / sae_vis

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
MIT License
128 stars 27 forks source link

Note - I'm still open to accepting PRs on this library, and am very happy for other people to build on it, but I won't be actively maintaining it going forwards since I'll be focusing on my job. The SAELens library will continue to have more development and iteration, and it uses a fork of this repo as well as containing a much larger suite of tools for working with SAEs, so depending on your use case you might find that library preferable!


This codebase was designed to replicate Anthropic's sparse autoencoder visualisations, which you can see here. The codebase provides 2 different views: a feature-centric view (which is like the one in the link, i.e. we look at one particular feature and see things like which tokens fire strongest on that feature) and a prompt-centric view (where we look at once particular prompt and see which features fire strongest on that prompt according to a variety of different metrics).

Install with pip install sae-vis. Link to PyPI page here.

Features & Links

Important note - this repo was significantly restructured in March 2024 (we'll remove this message at the end of April). The recent changes include:

Here is a link to a Google Drive folder containing 3 files:

In the demo Colab, we show the two different types of vis which are supported by this library:

  1. Feature-centric vis, where you look at a single feature and see e.g. which sequences in a large dataset this feature fires strongest on.
  1. Prompt-centric vis, where you input a custom prompt and see which features score highest on that prompt, according to a variety of possible metrics.

Citing this work

To cite this work, you can use this bibtex citation:

@misc{sae_vis,
    title  = {{SAE Visualizer}},
    author = {Callum McDougall},
    howpublished    = {\url{https://github.com/callummcdougall/sae_vis}},
    year   = {2024}
}

Contributing

This project is uses Poetry for dependency management. After cloning the repo, install dependencies with poetry install.

This project uses Ruff for formatting and linting, Pyright for type-checking, and Pytest for tests. If you submit a PR, make sure that your code passes all checks. You can run all checks with make check-all.

Version history (recording started at 0.2.9)