DavidUdell / sparse_circuit_discovery

Circuit discovery in GPT-2 small, using sparse autoencoding
MIT License
6 stars 1 forks source link

Support activation collection from The Pile #1

Closed DavidUdell closed 9 months ago

DavidUdell commented 9 months ago

I currently collect all LLM activations from truthful_qa. This is a legacy choice, from when I was hunting for truthfulness directions in activation space.

I would like to test how my technique generalizes, and for that I should also support activation collection on a representative subset of The Pile. That way, I can train and interpret autoencoders and circuits using one of those two datasets, and see how those features and circuits hold up under causal intervention on the other dataset. There is already a little support for a holdout validation subset, but this is a more interesting distributional shift to evaluate.

Basically, have an acts_collect_pile.py and an acts_collect_qa.py.

DavidUdell commented 9 months ago

openwebtext instead of The Pile, but same idea: a representative sample of the internet, alongside truthful_qa.