DavidUdell / sparse_circuit_discovery

Circuit discovery in GPT-2 small, using sparse autoencoding
MIT License
6 stars 1 forks source link

Add a t.abs() call in the threshold computation, to not exclude negative-valued effects. #94

Closed DavidUdell closed 4 months ago

DavidUdell commented 4 months ago

The effect.item() objects there are the actual signed activation diffs, not the absolute diffs used earlier at the branchings stage. So I need a t.abs() call to measure unsigned effect sizes.

It may also turn out that positive-valued diffs have larger magnitudes than negative-valued diffs, and that negative-valued diffs are the objects of interest in the graphs. In that case, I'll have to write in code to ignore positive-valued diffs while flagging that.