flaport / inverse_design

https://flaport.github.io/inverse_design
Apache License 2.0
16 stars 5 forks source link

Convergence behavior sub-par #19

Open jan-david-fischbach opened 1 year ago

jan-david-fischbach commented 1 year ago

Hey @ianwilliamson, @Dj1312, sorry to bother you. We do not seem to be able to match the incredible pace at which the optimizations converge in the paper. Possible problems under investigation include:

The latent design is randomly initialized with a bias so that the first feasible design is fully solid

Running the optimization loop including the loss from above for the modeconverter, but ignoring the generator.

Screenshot 2023-02-10 at 20 06 11

When comparing that to the paper:

Screenshot 2023-02-10 at 20 08 18

I have a hard time believing that the loss is reduced more strongly including the fabrication constraints than without.

_Originally posted by @Jan-David-Black in https://github.com/flaport/inverse_design/issues/18#issuecomment-1426227307_ the settings for the Adam optimizer are as in the paper: adam(0.01, b1=0.667, b2=0.9)

Ideas for further improvement:

Am I missing something? Regards JD

lucasgrjn commented 1 year ago

(I also ping @mfschubert, he may have some missing informations !)

Your message is very complete! I think you sum up all the problems we have encountered when trying to reproduce the results... I would list some of my thoughts obtained on my various tests.

The loss seems to be the correct one. The calculated gradient matches the evolution of the shape (step 2 of Fig.5). But it seems the generator does not reproduce the design so well. Since the evolution is well documented (Adam optimizer with parameters are given), the gradient updated should not be the problem.

Even if the losses without the generator present a factor of difference, the shape seems to be similar as those with the generator. Hence, I dont think is the most important point right now.

On none of my tests, I was able to obtain an almost monotonous decrease of the loss (like the noisy you showed). The best I have found quickly go to a good solution (but not the optimal pointed by the article) but then either saturated and stopped, or exploded few steps later.

Even with finer mesh (10nm), I also have some convergency issues. So I think it is reasonable to put this hypothesis aside.

To conclude, I think our problem is inside the conditional generator. Your optimized algorithm @Jan-David-Black and the Rust implementation of @flaport seem to be robust enough. But there must be a problem with the reward / the way we handle the reward... Maybe is there some subtilities of the paper we missed on the implementation?

jan-david-fischbach commented 1 year ago

One more thought: What is the notion of centering a brush on a pixel for a brush with an even pixel-width (e.g. 10 as used in the paper)?

jan-david-fischbach commented 1 year ago

Ah and another one: Does an iteration involve a single pair of (forward and adjoint) simulations, or are running multiple adjoint simulations, because the geometry does not change in between considered as a single iteration?

lucasgrjn commented 1 year ago

Ah and another one: Does an iteration involve a single pair of (forward and adjoint) simulations, or are running multiple adjoint simulations, because the geometry does not change in between considered as a single iteration?

To my understanding, an iteration is defined as:

  1. Foward
  2. Adjoint (and so, grads)
  3. Update And this is done each time. If the after-update binarized epsilon_r does not change, it may be possible to avoid the simulation by using a system of cache, but I think it is a little bit over-engineered. After thinking about it, the adjoint simulation will propagate the gradient, so I think it sould be calculated at each iteration!
jan-david-fischbach commented 1 year ago

If the after-update binarized epsilon_r does not change, it may be possible to avoid the simulation by using a system of cache, but I think it is a little bit over-engineered.

That actually happens automatically if you set cache=True on the @jaxit decorator. My question was: does it also count as an iteration if the epsilon distribution does not change

jan-david-fischbach commented 1 year ago

Strangely I get much better convergence with a step size much larger than specified in the paper. The convergence in the paper is still better. E.g. 0.4 -> -0.3 dB and -35dB after ca. 100 iterations for the wg-bend...

lucasgrjn commented 1 year ago

I also noticed, modifying the ADAM parameters changed the convergence. (I try to not modify them to stick to the article).

What about the convergence ? I can barely go to a loss of 1e-2 and the losses oscillates. I do not obtain the property of "convergence" as the red curve above..