Closed jogardi closed 2 years ago
Thanks for pointing this out It's a very new idea and has not totally been flushed out yet
I'm happy to try other things that work more consistently and are better theoretically justified For example, we could try to estimate the density of the EDD using a KDE, which is what is done when running the MP fits
I think using doing a histogram the same way as you do for plotting the ESD should work. Except you would need the bins to be the same each time you make the histogram rather than dynamicly sizing the bins like matplotlib does under the hood.
By the way, do you think that the heavy tailed distribution fits with the manifold and information bottleneck interpretations of how deep nets generalize? You show heavy tailed matrices are often lower rank. That is equivalent to saying they are projecting onto a lower dimensional manifold or that they are throwing away information. The idea is that the information being lost must be the noise since it is the result of training.
I know there are issues with the manifold and information bottleneck interpretations:
In quantum physics, the generalized matrix entropy (i.e the Von Neumann entropy) can be expressed directly in terms of the eigenvalues. And this is what we are doing in our JMLR paper
https://www.jmlr.org/papers/volume22/20-410/20-410.pdf.
See the discussion in section 2, and FIgure 1
So it's not clear to me why we can't just do this with an arbitrary cross-entropy / divergence as well. I'm open to trying other things if they work well from an engineering point of view, but, theoretically, I don't see the problem
I imagine we could use some more work to make the rand_distance more accurate and robust. (Again, we do have a KDE estimator for the densities, however, this is not so easy to do generally in an automated way for every possible ESD )
The heavy-tailed weight matrices do not lose (hard) rank. In fact, we specifically show this in our JMLR paper see: FIgure 28, They do lose soft-rank, but that's different than losing hard rank.
However, we do find that we can get an estimate of the out-of-sample test accuracy by evaluating the training accuracy on a low-rank approximation of each layer. This is described in our latest paper, where we call it SVDSmoothing https://arxiv.org/abs/2106.00734
WeightWatcher includes an SVDSmoothing function, which generates a low rank approximation to a pre-trained model. And this seems to work in some cases, but not all. We don't know why yet.
I recommend trying it, and see how it works for you
Thanks. Now I see that you discussed information bottleneck in your jlmr paper.
This is also like the way jpeg is denoising. The quantization coefficients are analogous to the eigenvalues, and the DCT transform is like the eigenvectors. The smaller quantization coefficients correspond to the high frequency component from the DCT because that is the noise. So JPEG is also losing soft rank rather than hard rank.
If the smaller eigenvalues correspond to overfitting then that that would explain why SVD smoothing gives the test accuracy.
< If the smaller eigenvalues correspond to overfitting then that that would explain why SVD smoothing gives the test accuracy.
in our most recent paper, we motivate SVDSmoothing as a data-dependent shape metric
On line 1306 of weightwatcher.py the sorted eigenvalues are plugged into the jensen shannon divergence just as if the eigenvalues were counts in a histogram. I understand that this still is a divergence in that the number will be bigger for a less similar set of eigenvalues. And you have shown that this works imperically. But this isn't making theoretical sense to me. When you plot the ESD you use bins as I would expect. This may be an ambiguity in documentation and not a code issue.