Variations to the Lorentzian Broadening

smartalecH commented 5 months ago

To condition the manual vjp of the eigenvectors, fmmax uses a lorentzian broadening term (like many other FMM/RCWA implementations):

https://github.com/facebookresearch/fmmax/blob/a509ffa99afd325673483893bf76efcf96ddba0d/src/fmmax/utils.py#L164-L168

That being said, the current formulation looks particularly sensitive to phase differences between eigenvalues. The convolution with the lorentzian test function should be complex-valued. One possible change could be:

$$ f_0 = \frac{\bar{Z}}{\bar{Z}Z + \epsilon} $$

where

$$ Z = \lambda_i-\lambda_j $$

mfschubert commented 5 months ago

This new expression is now live in v0.8.0. We should follow up with a derivation of the proper expression.

elinorbgr commented 2 months ago

Following this change, it may be beneficial to reduce the value of the EPS_EIG constant there:

https://github.com/facebookresearch/fmmax/blob/1c8618b6d9b4d68fb2794b8f59489c8c1c5a9bd5/src/fmmax/utils.py#L12

The previous implementation did actually not fully prevent the computation from being unstable (if $z$ is a complex number, $z^2+\epsilon$ can still be close 0 if $z$ is close to $i \sqrt{\epsilon}$, while $|z|^2+\epsilon$ cannot), I suspect this is why you had $\epsilon$ set to such a large value?

In any case, from my experience using FMMAX, changing this constant to a smaller value like 1e-20 is very beneficial for gradient descent. Without that change, the gradient estimation is very noisy, and convergence is much slower, when there is convergence at all.

mfschubert commented 2 months ago

Hi @elinorbgr thanks for reporting this.

Can you say more about the situations in which you are encountering issues? The value of 1e-6 was not carefully chosen, so it's good to revisit the selection.

Ultimately, it would be great to add a test (e.g. here) which motivates the change.

smartalecH commented 2 months ago

Thanks for your input!

I suspect this is why you had $epsilon$ set to such a large value

I think the current value stems from the fact that by default, jax (and thus, fmmax) loads in single precision. So any arithmetic operation with a scalar below 1e-7 is subject to roundoff. (Although it would be nice, @mfschubert, if we had a function that picked the right tolerance for the specified precision dynamically -- passing it manually through the chain is a bit cumbersome).

Without that change, the gradient estimation is very noisy

Have you checked the gradient for your problem? We've done some finite-difference checks. We see pretty good accuracy with double precision, and usually pretty good accuracy with single precision (as always, it depends on the problem). @lidongzh can you comment on your recent tests?

and convergence is much slower, when there is convergence at all.

Optimization convergence, right? (This can get confusing because we have lots of other threads discussing simulation convergence and even gradient convergence). If so, I wonder if you are actually injecting more noise into the gradient, which is why the optimizer converges so quickly (albeit at a worse local minima/maxima). By setting your tolerance so low, you're effectively omitting that term from the broadening. Maybe a better question is are you able to get deeper convergence with this tolerance? What kind of optimizer are you using?

To be fair, I'm not sure this is actually what's happening, but it'd be good to get a few more details behind the heuristics.

elinorbgr commented 2 months ago

Indeed, this is about optimization convergence, and I'm getting significantly deeper convergence by lowering the value of $\epsilon$.

The optimization I'm working on already has some explicit stochasticity (we're averaging on nuisance parameters), and I'm doing optimization with adamax. The difference in optimization behaviour between the two settings of $\epsilon$ is very stark, and I can mostly do no optimization without lowering that $\epsilon$.

I'd need to check deeper, but maybe I'm in a context where in my problem the absolute magnitude of eigenvalues is rather small, and thus the added $\epsilon$ is not a "small" term compared to the overall magnitude of the matrix ?

If this is the problem, maybe this would be fixed by making $\epsilon$ relative to the magnitude of the eigenvalues ? Thinking something like $\epsilon_{ij} = 10^{-6} * \max(|\lambda_i|^2, |\lambda_j|^2)$, I'll need to test that I think.

mfschubert commented 2 months ago

@elinorbgr This does seem to be an issue which may be specific to your problem. I just ran a small experiment, comparing different eps values in an optimization context (colab link).

It uses adamax optimization on the invrs-gym metagrating challenge. Here are the optimization trajectories for different eps values: This doesn't definitively answer the question about which eps value is best, but it suggests that the default 1e-6 is reasonable for this problem.

Can you say more about (or plot) the optimization trajectory for your problem? Does the objective not improve at all? If so, there may be an opportunity to improve your loss function.

elinorbgr commented 1 month ago

I'm not sure how much I can share here, but that gives me a few angles to study what happens. I'll come back when I have a clearer idea of what happens, thanks.

elinorbgr commented 1 month ago

After some close monitoring, it seems my earlier assumption was correct: in my specific problem, the eigenvalues appear to be pretty small, and the largest one has a magnitude of the order of $|\lambda|^2 \approx 10^{-5}$.

Relative to that, setting $\epsilon = 10^{-6}$ is not small at all, and introduces a significant bias in the computation of the gradient (overall, the computed magnitude of $1/(\lambda_i-\lambda_j)$ is underestimated by around 10%, or more for smaller eigenvalues).

I think a simple correction for that would be to make $\epsilon$ relative to the scale of the starting matrix. I've ran a quick test and a heuristic like $\epsilon = 10^{-6} \max(|\lambda_i|^2)$ seems to behave well on my problem, but maybe there can be something smarter to do here.

mfschubert commented 1 month ago

Thanks for this additional information. I will spend some time to make the fix.

elinorbgr commented 1 month ago

I've experimented a bit more around some heuristics, and in my specific setup, setting a different value of $\epsilon$ per component of the matrix as $\epsilon_{ij} = 10^{-6} |\lambda_i| |\lambda_j|$ seems to work very well, possibility even better than what I got with $\epsilon = 10^{-20}$.

I've been testing that by changing the Lorentzian Broadening to this:

delta_eig = eigenvalues_i - eigenvalues_j
effective_eps = eps * jnp.abs(eigenvalues_i) * jnp.abs(eigenvalues_j)
f_broadened = delta_eig.conj() / (jnp.abs(delta_eig) ** 2 + effective_eps)

So, I'm not sure if that's the definitive best way to stabilize that, but from my understanding of what this broadening is meant to achieve, I guess it's probably not a bad starting point.

mfschubert commented 1 month ago

@elinorbgr Could you try the new scheme implemented in https://github.com/facebookresearch/fmmax/pull/125?

It uses an eps value that is the product of eps_relative and the maximum eigenvalue difference. By the way, scaling eps by the eigenvalue magnitude would be problematic since one can shift the eigenvalues arbitrarily without changing the maximum difference.

I also introduced some new tests for gradient accuracy. If you have a different scheme in mind, you could implement it and run the tests, just to ensure they will pass.

mfschubert commented 1 month ago

Closed by #125

facebookresearch / fmmax

Variations to the Lorentzian Broadening #105