ApolloResearch / rib

Library for methods related to the Local Interaction Basis (LIB)
MIT License
3 stars 0 forks source link

Centered rib #257

Closed nix-apollo closed 7 months ago

nix-apollo commented 9 months ago

Centred rib

Description

Related Issue

Closes #248

Motivation and Context

This is a first version of centered rib. It seems likely that we might want to handle lambda and/or edge calculation differently in the future to make the calculation more principled. For instance, the baseline for IG is still the 0 point, instead of the mean activation.

How Has This Been Tested?

I have a test that checks invariants for the output of the centered rib build. Including that there is a single constant direction pointing in the direction we expect, and that activations in all other rib directions are centered.

This code was also used for various analysis in the OP report, where it seemed to do reasonable things.

Does this PR introduce a breaking change?

Residual stream reorder may break some analysis code. No interface changes.

nix-apollo commented 8 months ago

Todos:

nix-apollo commented 7 months ago

We only ever need a bias position for both:

Possibly I'll postpone this and not improve the current situation in this PR.

nix-apollo commented 7 months ago

Re: test tolerance. This was because I had accidentally made a test stricter by going from rtol=1e-5 (pytorch's default) to rtol=0 (pytest's default). Not because computation got less precise. I've reverted the change for consistency.