Closed stefan-apollo closed 6 months ago
Pre-stochastic sources:
Post-stochastic sources with legacy settings (should be identical up to numerics): This corresponds to 110 or 212 "trivial sources".
In both cases, time to calculate Cs: 0.36 minutes.
n_stochastic_sources = 1
(Time to calculate Cs: 0.01 minutes):
n_stochastic_sources = 5
(Time to calculate Cs: 0.03 minutes):
n_stochastic_sources = 100
:
Edit: On closer inspection of the pngs you do see small changes when stochastic sources are enabled, and even 100 is not enough
This seems insanely good when judging by eye, will plot imshow comparisons of edges now.
Direct comparison of edge values: Normalisation seems off:
Though even with normalization separate it seems to be this weird
Edge ablation comparison.
Full run
"122": 1.0,
"111": 0.9992167101827676,
"101": 0.9981723237597911,
"92": 0.995822454308094,
"83": 0.948041775456919,
"75": 0.927154046997389,
"68": 0.8684073107049608,
"61": 0.8052219321148825,
100 stochastuic siurces
"134": 1.0,
"122": 0.9997389033942559,
"111": 0.9994778067885117,
"101": 0.9981723237597911,
"92": 0.993733681462141,
"83": 0.974934725848564,
"75": 0.9506527415143603,
"68": 0.9336814621409921,
"61": 0.8986945169712793,
"55": 0.7997389033942559,
"49": 0.8203655352480418,
5 stochastic sources:
"148": 1.0,
"134": 0.9997389033942559,
"122": 0.9997389033942559,
"111": 0.9992167101827676,
"101": 0.9950391644908616,
"92": 0.9798955613577024,
"83": 0.9644908616187989,
"75": 0.9446475195822455,
"68": 0.9052219321148826,
"61": 0.8454308093994778,
"55": 0.8140992167101828,
"49": 0.8096605744125326,
Stochastic sources need ~25 more edges for 100% accuracy, and ~10 more edges for 99% accuracy
Todo:
Question:
n_stochastic_sources
to n_stochastic_sources_edges
? Done.Implement choice of stochastic sources on pos-only or hidden-dim-only
Done in #302
Tested on TinyStories. Stochastic sources seem to not make much of a difference:
Sources on out_hidden
rather than out_pos
(above).
Edit: Nix and Lucius found the bug!
I'm currently confused about the performance of the both
case. I remember edge-ablations performed well (checking again now), but the result edge value comparisons seem to fail badly (see failing test)
Yeah ablation tests seem fine. Maybe normalization is broken?
(compare to baseline above)
Huh, looks like I've added too many / too slow tests :(
Better interface: Have two n_
arguments rather than one n_
and one dim_
argument
DONE
Yeah ablation tests seem fine. Maybe normalization is broken?
The normalization for the both
case seems all over the place
FIXED
Implement stochastic sources (for LMs) for the jacobian basis. Over out_pos, out_hidden, or both.
Changes
n_stochastic_sources_basis_pos
andn_stochastic_sources_basis_hidden
to give number of stochastic edges and where to apply them (out pos dim, out hidden dim, or both). Both set toNone
(default) gives non-stochastic result.n_
do not matter and only their product matters.n_stochastic_sources
ton_stochastic_sources_edges
stochastic
edges tosquared
edges, and infer whether to use stochastic sources fromn_stochastic_sources_edges is not None
phi
takes a bunch of GPU memory, butin_grads
takes even more memory so don't worry about it now. There was now easy way to avoid it.Other changes
ablate_edges_and_eval
also return a dictionary of "n_edges_required" if BisectSchedule is used (otherwise empty)load_bases_and_ablate
return theresults
rather than justresults["results"]
. Adapted tests / places where it is called.plot_ablations.py
main()
to take Path or str, and return out_file.Fix a random(fixed in main)plotting.py
typoNotes
Before merge