Closed DavidUdell closed 8 months ago
Good idea past self! Indeed, there is painful mode collapse when I do ablations and just look at greedy sampled continuations. I need to fix this; this being messed up screens off all my results so far, for better or worse.
TODO: Build out the logits comparison in that script.
My credence in the mangling hypothesis is way up. I suspect I'll have to at least make my ablations more surgical, leaving more/most of the sequences intact to set up context. I may also have to mess with the hooks again, though I'm not quite sure what to do there except for scale down the whole subtracted projection (equivalent to relatively upweighting the residual connection for output[0]
.
As another pseudo-test, I'll write a
chat.py
script ininterp_tools/
. That way, I can check whether my logits make sense and whether the model is manifestly too dumb to have a circuit for something.