Weighted PCA

Description

$$G{ij} = \sum{x,s,t} \left[f^l_{si}(x) \sumk \left(\frac{\partial L}{\partial f^l{sk}}(x) \frac{\partial L}{\partial f^l{tk}}(x)\right) f^l{tj}\right]$$ The reason to do this is to have a better approximation to the overall orthogonalisation of $\frac{\partial L}{\partial f^l_i}(x) f^l_j(x)$ by diagonalising each independently.

Related Issue

Motivation and Context

It might have done better than normal PCA in terms of identifying the right directions, assuming less about the independence of the gradient term and the function term in the large 4-tensor we are trying to orthogonalise. In practice, we ran a short experiment on tiny stories comparing weighted svd to normal svd, and then did a node ablation and found that the curve was better (loss stays lower for longer) with unweighted svd.

How Has This Been Tested?

Checked that it agrees with unweighted pca (in the sense of same node ablation curve) in the case that we sub in an identity matrix instead of the metric induced by the gradients.

Does this PR introduce a breaking change?

It breaks with centering present

ApolloResearch / rib

Implementation of PCA weighted by the size of the gradient on each datapoint #341