UFO-101 / auto-circuit

A library for efficient patching and automatic circuit discovery.
https://UFO-101.github.io/auto-circuit
18 stars 7 forks source link

Integrated Gradients not faithful to original formulation #6

Open oliveradk opened 1 month ago

oliveradk commented 1 month ago

The current implementation of integrated gradients interpolates on the entire patch mask. However, this introduces dependencies between upstream nodes and downstream nodes. Let f_k(x) denote the output of a component k, and f_k_alpha(x) denote the output of a component with previous component outputs interpolated with alpha. We want to be substituting f_k(x) for alpha f_k(x) + (1-alpha) f_k(x), but instead we're substituting alpha * f_k_alpha(x) + (1-alpha) f_k_alpha(x). Sparse Feature Circuits addresses this in section 2:

This [IG] cannot be done in parallel for two nodes when one is downstream of another, but can be done in parallel for arbitrarily many nodes which do not depend on each other. Thus the additional cost of computing ˆIEig over ˆIEatp scales linearly in N and the serial depth of m’s computation graph.

I don't think the edge case is different, and thus think the current implementation in prune_algos.mask_gradient is incorrect, and should be adjusted to compute scores iteratively over source node layers.

(This would change the time complexity from O(forward N) to O(forward n_layers * N), so maybe just add as an optional setting)

oliveradk commented 1 month ago

Eh nmv, I guess its just a matter of what you want to "normalize" against, and normalizing against edges seems appropriate.

UFO-101 commented 1 month ago

Integrated Gradients not faithful to original formulation

This implementation predates the Sparse Feature Circuits paper by several months, so it was not supposed to be an implementation of their method.

However I agree with their reasoning that it often makes more sense not do for different layer simultaneously. I'd be happy for you or someone else to reopen this issue. I do not plan to implement it myself in the immediate future.