I currently collect all LLM activations from truthful_qa. This is a legacy choice, from when I was hunting for truthfulness directions in activation space.
I would like to test how my technique generalizes, and for that I should also support activation collection on a representative subset of The Pile. That way, I can train and interpret autoencoders and circuits using one of those two datasets, and see how those features and circuits hold up under causal intervention on the other dataset. There is already a little support for a holdout validation subset, but this is a more interesting distributional shift to evaluate.
Basically, have an acts_collect_pile.py and an acts_collect_qa.py.
I currently collect all LLM activations from
truthful_qa
. This is a legacy choice, from when I was hunting for truthfulness directions in activation space.I would like to test how my technique generalizes, and for that I should also support activation collection on a representative subset of The Pile. That way, I can train and interpret autoencoders and circuits using one of those two datasets, and see how those features and circuits hold up under causal intervention on the other dataset. There is already a little support for a holdout validation subset, but this is a more interesting distributional shift to evaluate.
Basically, have an
acts_collect_pile.py
and anacts_collect_qa.py
.