dtch1997 / sae-eap

Edge attribution patching with SAEs
0 stars 0 forks source link

[Bug] Attribution scores don't match original. #9

Open dtch1997 opened 4 days ago

dtch1997 commented 4 days ago

In 'notebooks/test_our_attrib_matches_original.ipynb, we check our attribution scores against those computed by the original implementation.

Annoyingly, the scores don't match. Have to figure out why this is the case...

dtch1997 commented 4 days ago

Some quick sanity checks.

dtch1997 commented 4 days ago

The gold standard would be to compare to Michael Hanna's implementation and ensure that we get the same scores per edge.

dtch1997 commented 4 days ago

It's probably faster to first implement the pruning and evaluation code (which we'll need eventually anyway) and check if our circuit metrics match his.