That is, train linear maps with ReLUs between activations collected at two layers. Have the learned map try to be sparse in connections, thus directly optimizing for a sparse circuit.
Note that this would still be open-ended exploration, without a clear ground truth in naturalistic transformer models, though, and so maybe not very rewarding science for that reason.
That is, train linear maps with ReLUs between activations collected at two layers. Have the learned map try to be sparse in connections, thus directly optimizing for a sparse circuit.
Note that this would still be open-ended exploration, without a clear ground truth in naturalistic transformer models, though, and so maybe not very rewarding science for that reason.