Closed DavidUdell closed 9 months ago
What I've learned so far is that the simplest blunt approach doesn't work. If you just zero out a feature during every model forward pass, downstream features don't appear to respond as you'd semantically expect if the model just stopped thinking that thought. Of course, I can do more to confirm this theory, but that's my current conjecture based off of my work so far.
So, the next step, and part of replicating prior work anyways, is to more surgically ablate or scale features only at max activating sequence positions or only at final sequence positions. If I am able to get clean graphs here, that's evidence that the above "mangled context" theory is on the right track.
It's unfortunate that we can't just naively scale or ablate feature dimensions found using autoencoders and get semantically appropriate downstream effects. It would have been quite nice to have the model's "conceptual API" so easily. My current largest glimmer of hope here is that I haven't yet replicated the best causal graph results achieved, and so I want to do that and try all the incremental steps from an already successful starting point. Still, the most naive approach failing sucks, in terms of alignment going well.
Arguably, ablating only the last sequence position is having effects with the right signs, for the very first thing I tried. But it's definitely not night and day; I can't explain why so many other apparently unrelated features are also affected, and more strongly at that. Will try with the max activating sequence positions next.
Now fully replicating Cunningham et al.
In principle, all the causal graphing infrastructure is now in place! Before adding any further features, I need to be getting clean, uncluttered, sparse graphs back.