Open dtch1997 opened 1 week ago
Assuming we implement #5 , we could natively support SAE node attributions without splicing as follows:
Insight: SAE attributions can be found via path patching.
SAE feature act -> SAE output -> Dest
Metric
SAE act post -> Dest
Src -> SAE input -> SAE act post
SAE input
Src -> SAE act post
How do we do path attribution patching? I think it's just the chain rule applied to EAP.
A -> B -> C
(dMetric/dC) * (dC / dB) * (dB / dA) * (A_clean - A corrupt)
Assuming we implement #5 , we could natively support SAE node attributions without splicing as follows:
Insight: SAE attributions can be found via path patching.
SAE feature act -> SAE output -> Dest
w.r.tMetric
.SAE act post -> Dest
w.r.tMetric
Src -> SAE input -> SAE act post
.SAE input
is a "backward blanket" of SAE act post, this reduces to attribution-patching the edgeSrc -> SAE act post
w,r,tMetric
How do we do path attribution patching? I think it's just the chain rule applied to EAP.
A -> B -> C
w.r.tMetric
:(dMetric/dC) * (dC / dB) * (dB / dA) * (A_clean - A corrupt)