Stochastic sources jacobian basis

stefan-apollo commented 6 months ago

Implement stochastic sources (for LMs) for the jacobian basis. Over out_pos, out_hidden, or both.

Changes

Add arguments n_stochastic_sources_basis_pos and n_stochastic_sources_basis_hidden to give number of stochastic edges and where to apply them (out pos dim, out hidden dim, or both). Both set to None (default) gives non-stochastic result.
- Note that if both are used, the individual values n_ do not matter and only their product matters.
Rename n_stochastic_sources to n_stochastic_sources_edges
Rename stochastic edges to squared edges, and infer whether to use stochastic sources from n_stochastic_sources_edges is not None
The jacobian basis calculation now always uses "sources" which are either stochastic, or an identity matrix (recovering the non-stochastic behaviour). This is done to avoid duplication.
We use a two dimensional stochastic source to easily support the 2 different dimensions of stochastic sources.
Note that phi takes a bunch of GPU memory, but in_grads takes even more memory so don't worry about it now. There was now easy way to avoid it.
Better TQDM descriptions

Other changes

Make ablate_edges_and_eval also return a dictionary of "n_edges_required" if BisectSchedule is used (otherwise empty)
Make load_bases_and_ablate return the results rather than just results["results"]. Adapted tests / places where it is called.
Allow plot_ablations.py main() to take Path or str, and return out_file.
~~Fix a random plotting.py typo~~ (fixed in main)
Fix RootPath serialization
Handle 0 in SplitLN ablations: Make the node Variance rather than Variance + Eps

Notes

Monotonic scaling is not given. More edges are sometimes better and sometimes not.

Before merge

Undo .github changes, remove some tests

stefan-apollo commented 6 months ago

Pre-stochastic sources:

Post-stochastic sources with legacy settings (should be identical up to numerics): This corresponds to 110 or 212 "trivial sources".

In both cases, time to calculate Cs: 0.36 minutes.

n_stochastic_sources = 1 (Time to calculate Cs: 0.01 minutes):

n_stochastic_sources = 5 (Time to calculate Cs: 0.03 minutes):

n_stochastic_sources = 100:

Edit: On closer inspection of the pngs you do see small changes when stochastic sources are enabled, and even 100 is not enough

This seems insanely good when judging by eye, will plot imshow comparisons of edges now.

stefan-apollo commented 6 months ago

Direct comparison of edge values: Normalisation seems off:

Though even with normalization separate it seems to be this weird

stefan-apollo commented 6 months ago

Edge ablation comparison.

Full run

            "122": 1.0,
            "111": 0.9992167101827676,
            "101": 0.9981723237597911,
            "92": 0.995822454308094,
            "83": 0.948041775456919,
            "75": 0.927154046997389,
            "68": 0.8684073107049608,
            "61": 0.8052219321148825,

100 stochastuic siurces

            "134": 1.0,
            "122": 0.9997389033942559,
            "111": 0.9994778067885117,
            "101": 0.9981723237597911,
            "92": 0.993733681462141,
            "83": 0.974934725848564,
            "75": 0.9506527415143603,
            "68": 0.9336814621409921,
            "61": 0.8986945169712793,
            "55": 0.7997389033942559,
            "49": 0.8203655352480418,

5 stochastic sources:

            "148": 1.0,
            "134": 0.9997389033942559,
            "122": 0.9997389033942559,
            "111": 0.9992167101827676,
            "101": 0.9950391644908616,
            "92": 0.9798955613577024,
            "83": 0.9644908616187989,
            "75": 0.9446475195822455,
            "68": 0.9052219321148826,
            "61": 0.8454308093994778,
            "55": 0.8140992167101828,
            "49": 0.8096605744125326,

Stochastic sources need ~25 more edges for 100% accuracy, and ~10 more edges for 99% accuracy

stefan-apollo commented 6 months ago

Todo:

Implement choice of stochastic sources on pos-only or hidden-dim-only

Question:

Should we rename n_stochastic_sources to n_stochastic_sources_edges? Done.

stefan-apollo commented 6 months ago

Implement choice of stochastic sources on pos-only or hidden-dim-only

Done in #302

stefan-apollo commented 6 months ago

Tested on TinyStories. Stochastic sources seem to not make much of a difference:

stefan-apollo commented 6 months ago

Sources on out_hidden rather than out_pos (above).

stefan-apollo commented 6 months ago

Edit: Nix and Lucius found the bug!

I'm currently confused about the performance of the both case. I remember edge-ablations performed well (checking again now), but the result edge value comparisons seem to fail badly (see failing test)

stefan-apollo commented 6 months ago

Yeah ablation tests seem fine. Maybe normalization is broken? (compare to baseline above)

stefan-apollo commented 6 months ago

Huh, looks like I've added too many / too slow tests :(

stefan-apollo commented 6 months ago

Better interface: Have two n_ arguments rather than one n_ and one dim_ argument

DONE

stefan-apollo commented 6 months ago

Yeah ablation tests seem fine. Maybe normalization is broken?

The normalization for the both case seems all over the place

FIXED

ApolloResearch / rib