[Bug] Attribution scores don't match original.

dtch1997 / sae-eap

Edge attribution patching with SAEs

0 stars 0 forks source link

[Bug] Attribution scores don't match original. #9

Open dtch1997 opened 4 days ago

dtch1997 commented 4 days ago

In 'notebooks/test_our_attrib_matches_original.ipynb, we check our attribution scores against those computed by the original implementation.

Annoyingly, the scores don't match. Have to figure out why this is the case...

dtch1997 commented 4 days ago

Some quick sanity checks.

Rerun with a single batch
Scatterplot orig vs our edge scores
- Make sure edge order is correct
Check that the circuit we get is reasonable. Depends on new features

dtch1997 commented 4 days ago

The gold standard would be to compare to Michael Hanna's implementation and ensure that we get the same scores per edge.

This requires us to refactor his code to compute an AttributionScores object
We also need to standardize the order of computing edges.

dtch1997 commented 4 days ago

It's probably faster to first implement the pruning and evaluation code (which we'll need eventually anyway) and check if our circuit metrics match his.