[]()

Automatic Circuit DisCovery

This is the accompanying code to the paper "Towards Automated Circuit Discovery for Mechanistic Interpretability" (NeurIPS 2023 Spotlight).

:zap: To run ACDC, see acdc/main.py, or this Colab notebook
:wrench: To see how edit edges in computational graphs in models, see notebooks/editing_edges.py or this Colab notebook
:sparkle: To understand the low-level implementation of completely editable computational graphs, see this Colab notebook or notebooks/implementation_demo.py

This library builds upon the abstractions (HookPoints and standardised HookedTransformers) from TransformerLens :mag_right:

Installation:

First, install the system dependencies for either Mac or Linux.

Then, you need Python 3.8+ and Poetry to install ACDC, like so

git clone git+https://github.com/ArthurConmy/Automatic-Circuit-Discovery.git
cd Automatic-Circuit-Discovery
poetry env use 3.10      # Or be inside a conda or venv environment
                         # Python 3.10 is recommended but use any Python version >= 3.8
poetry install

System Dependencies

:penguin: Ubuntu Linux

sudo apt-get update && sudo apt-get install libgl1-mesa-glx graphviz build-essential graphviz-dev

You may also need apt-get install python3.x-dev where x is your Python version (also see the issue and pygraphviz installation troubleshooting)

:apple: Mac OS X

On Mac, you need to let pip (inside poetry) know about the path to the Graphviz libraries.

brew install graphviz
export CFLAGS="-I$(brew --prefix graphviz)/include"
export LDFLAGS="-L$(brew --prefix graphviz)/lib"

Reproducing results

To reproduce the Pareto Frontier of KL divergences against number of edges for ACDC runs, run python experiments/launch_induction.py. Similarly, python experiments/launch_sixteen_heads.py and python subnetwork_probing/train.py were used to generate individual data points for the other methods, using the CLI help. All these three commands can produce wandb runs. We use notebooks/roc_plot_generator.py to process data from wandb runs into JSON files (see experiments/results/plots_data/Makefile for the commands) and notebooks/make_plotly_plots.py to produce plots from these JSON files.

Tests

From the root directory, run

pytest -vvv -m "not slow"

This will only select tests not marked as slow. These tests take a long time, and are good to run occasionally, but not every time.

You can run the slow tests with

pytest -s -m slow

Contributing

We welcome issues where the code is unclear!

If your PR affects the main demo, rerun

chmod +x experiments/make_notebooks.sh
./experiments/make_notebooks.sh

to automatically turn the main.py into a working demo and check that no errors arise. It is essential that the notebooks converted here consist only of #%% [markdown] markdown-only cells, and #%% cells with code.

Citing ACDC

If you use ACDC, please reach out! You can reference the work as follows:

@inproceedings{conmy2023automated,
      title={Towards Automated Circuit Discovery for Mechanistic Interpretability}, 
      author={Arthur Conmy and Augustine N. Mavor-Parker and Aengus Lynch and Stefan Heimersheim and Adri{\`a} Garriga-Alonso},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
      year={2023},
      eprint={2304.14997},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

TODO

Mostly finished TODO list

[ x ] Make `TransformerLens` install be Neel's code not my PR [ x ] Add `hook_mlp_in` to `TransformerLens` and delete `hook_resid_mid` (and test to ensure no bad things?) [ x ] Delete `arthur-try-merge-tl` references from the repo [ x ] Make notebook on abstractions [ ? ] Fix huge edge sizes in Induction Main example and change that occurred [ x ] Find a better way to deal with the versioning on the Colabs installs... [ ] Neuron-level experiments [ ] Position-level experiments [ ] Edge gradient descent experiments [ ] Implement the circuit breaking paper [ x ] `tracr` and other dependencies better managed [ ? ] Make SP tests work (lots outdated so skipped) - and check SubnetworkProbing installs properly (no __init__.pys !!!) [ ? ] Make the 9 tests also failing on TransformerLens-main pass [ x ] Remove Codebase under construction

ArthurConmy / Automatic-Circuit-Discovery

readme