Write logits comparison in `chat.py`

DavidUdell / sparse_circuit_discovery

Circuit discovery in GPT-2 small, using sparse autoencoding

MIT License

7 stars 1 forks source link

Write logits comparison in `chat.py` #56

Closed DavidUdell closed 8 months ago

DavidUdell commented 8 months ago

As another pseudo-test, I'll write a chat.py script in interp_tools/. That way, I can check whether my logits make sense and whether the model is manifestly too dumb to have a circuit for something.

DavidUdell commented 8 months ago

Good idea past self! Indeed, there is painful mode collapse when I do ablations and just look at greedy sampled continuations. I need to fix this; this being messed up screens off all my results so far, for better or worse.

DavidUdell commented 8 months ago

TODO: Build out the logits comparison in that script.

My credence in the mangling hypothesis is way up. I suspect I'll have to at least make my ablations more surgical, leaving more/most of the sequences intact to set up context. I may also have to mess with the hooks again, though I'm not quite sure what to do there except for scale down the whole subtracted projection (equivalent to relatively upweighting the residual connection for output[0].