callummcdougall / sae_vis

Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
MIT License
163 stars 34 forks source link

Note - I'm planning for a major refactor of this library soon - branch can be found here (containing demo examples too). The current way it works is an unholy patchwork of Python / HTML / JavaScript; the new version is much simpler: the vis is created with a minimal pre-existing HTML framework, instead it's populated using JavaScript, and the only way Python interfaces with JavaScript is to dump a single DATA dictionary into the JavaScript page. I've also created an Othello SAE vis, pictured below (also see it on my personal website homepage). I plan to get around to pushing updates to this library in late September / early October, so watch this space!


Summary

This codebase was designed to replicate Anthropic's sparse autoencoder visualisations, which you can see here. The codebase provides 2 different views: a feature-centric view (which is like the one in the link, i.e. we look at one particular feature and see things like which tokens fire strongest on that feature) and a prompt-centric view (where we look at once particular prompt and see which features fire strongest on that prompt according to a variety of different metrics).

Install with pip install sae-vis. Link to PyPI page here.

Features & Links

Important note - this repo was significantly restructured in March 2024 (we'll remove this message at the end of April). The recent changes include:

Here is a link to a Google Drive folder containing 3 files:

In the demo Colab, we show the two different types of vis which are supported by this library:

  1. Feature-centric vis, where you look at a single feature and see e.g. which sequences in a large dataset this feature fires strongest on.
  1. Prompt-centric vis, where you input a custom prompt and see which features score highest on that prompt, according to a variety of possible metrics.

Citing this work

To cite this work, you can use this bibtex citation:

@misc{sae_vis,
    title  = {{SAE Visualizer}},
    author = {Callum McDougall},
    howpublished    = {\url{https://github.com/callummcdougall/sae_vis}},
    year   = {2024}
}

Contributing

This project is uses Poetry for dependency management. After cloning the repo, install dependencies with poetry install.

This project uses Ruff for formatting and linting, Pyright for type-checking, and Pytest for tests. If you submit a PR, make sure that your code passes all checks. You can run all checks with make check-all.

Version history (recording started at 0.2.9)