Convergence behavior sub-par

jan-david-fischbach commented 1 year ago

Hey @ianwilliamson, @Dj1312, sorry to bother you. We do not seem to be able to match the incredible pace at which the optimizations converge in the paper. Possible problems under investigation include:

errors in the loss function (#18 -> doesn't seem to be the case)
problems with the way the boundaries of the design region are handled. (the convergence behavior seems to persist even when disabling that.)
selecting the hyperparameters: What should the value of $\beta$ be... Do we have to schedule changes to beta during the course of the optimization. ( Maybe even use it as an optimized parameter (not in the original paper as far as I see) -> https://github.com/flaport/inverse_design/issues/10#issuecomment-1426191124)
the initialization: The convergence behavior is highly dependent on the bias and random spread of the initial latent space. How do we select appropriate values here? the original paper only states:

The latent design is randomly initialized with a bias so that the first feasible design is fully solid

the gradients: to test the rest of the optimization loop it was run without the generator:

Running the optimization loop including the loss from above for the modeconverter, but ignoring the generator.

When comparing that to the paper:

I have a hard time believing that the loss is reduced more strongly including the fabrication constraints than without.

_Originally posted by @Jan-David-Black in https://github.com/flaport/inverse_design/issues/18#issuecomment-1426227307_ the settings for the Adam optimizer are as in the paper: adam(0.01, b1=0.667, b2=0.9)

@Dj1312 suggested that the discrepancy might be related to a differing way to translate the scattering parameters to dB quantities. It is correct to assume that ceviche-challenges returns the S-parameters as field quantities, right?
coarser mesh: Another source of discrepancy is the fact that I generally use a coarser mesh (sometimes I try with 10nm resolution, but it slows down the workflow quite dramatically)
As described in the paper different brushes and design region size lead to a more or less challenging optimization problem. Increasing the design region and decreasing the brushsize doesn't seem to improve the convergence significantly.

Ideas for further improvement:

use AdamW to avoid stagnation because of low gradients for saturated latent space.
scale the transform by some factor 1<s<2 to make the transformed latent space even more similar to the generator output. From the paper:

Thus, the estimator can be seen as a differentiable approximation of the conditional generator. The success of this estimator is consistent with the finding in binary neural networks that estimators which approximate their forward-pass counterpart outperform simpler functions

Am I missing something? Regards JD

lucasgrjn commented 1 year ago

(I also ping @mfschubert, he may have some missing informations !)

Your message is very complete! I think you sum up all the problems we have encountered when trying to reproduce the results... I would list some of my thoughts obtained on my various tests.

The loss seems to be the correct one. The calculated gradient matches the evolution of the shape (step 2 of Fig.5). But it seems the generator does not reproduce the design so well. Since the evolution is well documented (Adam optimizer with parameters are given), the gradient updated should not be the problem.

Even if the losses without the generator present a factor of difference, the shape seems to be similar as those with the generator. Hence, I dont think is the most important point right now.

On none of my tests, I was able to obtain an almost monotonous decrease of the loss (like the noisy you showed). The best I have found quickly go to a good solution (but not the optimal pointed by the article) but then either saturated and stopped, or exploded few steps later.

Even with finer mesh (10nm), I also have some convergency issues. So I think it is reasonable to put this hypothesis aside.

To conclude, I think our problem is inside the conditional generator. Your optimized algorithm @Jan-David-Black and the Rust implementation of @flaport seem to be robust enough. But there must be a problem with the reward / the way we handle the reward... Maybe is there some subtilities of the paper we missed on the implementation?

jan-david-fischbach commented 1 year ago

One more thought: What is the notion of centering a brush on a pixel for a brush with an even pixel-width (e.g. 10 as used in the paper)?

jan-david-fischbach commented 1 year ago

Ah and another one: Does an iteration involve a single pair of (forward and adjoint) simulations, or are running multiple adjoint simulations, because the geometry does not change in between considered as a single iteration?

lucasgrjn commented 1 year ago

Ah and another one: Does an iteration involve a single pair of (forward and adjoint) simulations, or are running multiple adjoint simulations, because the geometry does not change in between considered as a single iteration?

To my understanding, an iteration is defined as:

Foward
Adjoint (and so, grads)
Update And this is done each time. ~~If the after-update binarized epsilon_r does not change, it may be possible to avoid the simulation by using a system of cache, but I think it is a little bit over-engineered.~~ After thinking about it, the adjoint simulation will propagate the gradient, so I think it sould be calculated at each iteration!

jan-david-fischbach commented 1 year ago

If the after-update binarized epsilon_r does not change, it may be possible to avoid the simulation by using a system of cache, but I think it is a little bit over-engineered.

That actually happens automatically if you set cache=True on the @jaxit decorator. My question was: does it also count as an iteration if the epsilon distribution does not change

jan-david-fischbach commented 1 year ago

Strangely I get much better convergence with a step size much larger than specified in the paper. The convergence in the paper is still better. E.g. 0.4 -> -0.3 dB and -35dB after ca. 100 iterations for the wg-bend...

lucasgrjn commented 1 year ago

I also noticed, modifying the ADAM parameters changed the convergence. (I try to not modify them to stick to the article).

What about the convergence ? I can barely go to a loss of 1e-2 and the losses oscillates. I do not obtain the property of "convergence" as the red curve above..

flaport / inverse_design

Convergence behavior sub-par #19