Optimize for (almost) linear approximations of transformer layers directly

DavidUdell / sparse_circuit_discovery

Circuit discovery in GPT-2 small, using sparse autoencoding

MIT License

7 stars 1 forks source link

Optimize for (almost) linear approximations of transformer layers directly #53

Closed DavidUdell closed 8 months ago

DavidUdell commented 8 months ago

That is, train linear maps with ReLUs between activations collected at two layers. Have the learned map try to be sparse in connections, thus directly optimizing for a sparse circuit.

Note that this would still be open-ended exploration, without a clear ground truth in naturalistic transformer models, though, and so maybe not very rewarding science for that reason.