NanoComp / meep

free finite-difference time-domain (FDTD) software for electromagnetic simulations
GNU General Public License v2.0
1.2k stars 613 forks source link

Optimization failure with adjoint simulations #1471

Open yaugenst opened 3 years ago

yaugenst commented 3 years ago

Dear Meep devs,

I am playing around with the adjoint routines and have run into some issues. I am not sure whether I am doing something wrong or if these are bugs, so here goes my little report.

The optimization setup is as follows: Single waveguide going from left to right with a design region in the middle, eigenmode input on the left, eigenmode monitor on the right. Coupling from the first to the first order mode should give a (more or less) straight waveguide. I am not doing any sort of material interpolation/penalization just to keep things as simple as possible. All gradients used to update the material are coming straight from meep.adjoint. Everything is running in a fresh conda environment: conda create -n meep_latest -c conda-forge pymeep=1.17.1 nlopt python=3.7.

Here is the code in question:

#!/usr/bin/env python3

import autograd.numpy as anp
import matplotlib.pyplot as plt
import meep as mp
import meep.adjoint as mpa
import nlopt
import numpy as np

res = 20
lcen = 1.5
dpml = 1.0
cell = mp.Vector3(8.0, 6.0)
n1 = mp.Medium(index=1.0)
n2 = mp.Medium(index=1.5)
x_design = 4.0
y_design = 3.0
d_wvg = 2.0
input_mode = 1
output_mode = 1
shape = (int(res * x_design), int(res * y_design))

design_variables = mp.MaterialGrid(
    mp.Vector3(res * x_design, res * y_design), n1, n2, grid_type="U_SUM"
)

design_region = mpa.DesignRegion(
    design_variables,
    volume=mp.Volume(
        center=mp.Vector3(0, 0),
        size=mp.Vector3(x_design, y_design),
    ),
)

geometry = [
    mp.Block(
        center=mp.Vector3(0, 0),
        size=mp.Vector3(cell.x, d_wvg),
        material=n2,
    ),
    mp.Block(
        center=design_region.center,
        size=design_region.size,
        material=design_variables,
    ),
]

source = [
    mp.EigenModeSource(
        mp.GaussianSource(wavelength=lcen, width=0.2),
        center=mp.Vector3(-cell.x / 2 + dpml, 0),
        size=mp.Vector3(0, cell.y),
        direction=mp.NO_DIRECTION,
        eig_kpoint=mp.Vector3(1, 0, 0),
        eig_parity=mp.EVEN_Y + mp.ODD_Z,
        eig_band=input_mode,
    )
]

sim = mp.Simulation(
    cell_size=cell,
    boundary_layers=[mp.PML(dpml)],
    sources=source,
    geometry=geometry,
    resolution=res,
    eps_averaging=False,
)

obj_args = [
    mpa.EigenmodeCoefficient(
        sim,
        mp.Volume(center=mp.Vector3(cell.x / 2 - dpml, 0), size=mp.Vector3(0, cell.y)),
        mode=output_mode,
    ),
]

def meep_objective(*alpha):
    p = anp.abs(alpha) ** 2
    return p[0]

op = mpa.OptimizationProblem(
    simulation=sim,
    objective_functions=[meep_objective],
    objective_arguments=obj_args,
    design_regions=[design_region],
    frequencies=[1 / lcen],
    decay_by=1e-4,
    decay_fields=[mp.Ez],
)

fig, ax = plt.subplots(1, 2, figsize=(9, 3))
plt.ion()
plt.show()
fom = []
x0 = np.ones(design_variables.num_params) / 2

def nlopt_obj(x, gd):
    v, g = op([x])
    if gd.size > 0:
        gd[:] = g

    fom.append(v)
    if mp.am_really_master():
        ax[0].cla()
        op.plot2D(True, ax=ax[0])
        ax[1].cla()
        ax[1].plot(fom)
        plt.tight_layout()
        plt.pause(0.01)

    return v

opt = nlopt.opt(nlopt.LD_LBFGS, x0.size)
opt.set_max_objective(nlopt_obj)
opt.set_lower_bounds(0)
opt.set_upper_bounds(1)
opt.set_maxeval(10)
opt.optimize(x0)

Running this code as-is gives an optimization that looks as follows,

bfgs_grad_sign

where we can clearly see that the optimization does not work. Also, nlopt fails after a few more iterations. This is caused by the sign of the gradient, i.e. setting gd[:] = -g in nlopt_obj() fixes the issue. The optimization then looks like this:

bfgs_16_its

We can see that the optimization seems to work fine initially, and there is some change in the material, however it then just kind of flatlines after a few more iterations (and nlopt fails). Using MMA, we get a similar result, just without the nlopt failure:

mma_20_its

Initially I thought that this might be due to the optimization hitting its bounds, however this seems unlikely looking at the "optimized" material distribution.

I have tried this with different resolutions, standard topoogy optimization parametrizations, and also different geometries and it seems to me like this behavior can be reproduced consistently.

My question is now: Am I using meep.adjoint incorrectly, or is there a more fundamental issue?

Thanks for your time!

yaugenst commented 3 years ago

A small update - doing naive gradient descent works much better, surprisingly. I replaced the optimization with the following:

for _ in range(100):
    v, g = op([x0])
    x0 -= g / np.linalg.norm(g)
    x0[x0 > 1] = 1  # enforce constraints
    x0[x0 < 0] = 0

This produces pretty much the expected result:

gd_100_its

I am not sure what to make of this. To me it seems that there is something going on with the gradients that breaks BFGS, maybe something weird with the gradient magnitudes between iterations that breaks the Hessian approximation...

ianwilliamson commented 3 years ago

Just a drive-by comment here...

Coupling from the first to the first order mode should give a (more or less) straight waveguide.

In general, I don't know if this would be my expectation, especially if you're optimizing at only a single frequency... although, you managed to produce something like that in your follow up result.

I'd say it can be hard to reason about the magnitudes of the coefficients that you're optimizing (and by extension where the optimizer is getting stuck). You might consider optimizing a normalized quantity so that your figure of merit corresponds to something like % of power transmission. Taking a look at the field distribution can also be useful for physically debugging these situations. Sometimes things like this happen because you accidentally have a source or a monitor oriented in the wrong direction (e.g. + vs - direction along an axis).

yaugenst commented 3 years ago

Thanks for your comment! You are of course correct in that normalized quantities are better in general. In any case, I think I found the mistake, and it seems to be the classic FDTD screwup... bad pulse width. Changing the pulse from width=0.2 to width=0.3 (i.e. roughly 0.2 * lcen) fixes the issue. I don't fully understand why the pulse width would screw with the gradients so much as we are only checking the center frequency here and the convergence criteria is the same...

smartalecH commented 3 years ago

Thanks for taking the time to post this. We've seen this before, but haven't had the time (or immediate need) to debug.

The naive first step to debug this would be to check that the "narrowband" simulation's gradient matches the finite difference approximation (just like we do in the tutorial). Even if the narrowband gradient matches, however, that doesn't mean it's the correct gradient! The finite difference itself might be estimating the gradient for an inaccurate calculation of your cost function.

For example, if you don't run until the Fourier transformed fields (reasonably) converge, then the resultant eigenmode coefficients will be (physically) incorrect but the adjoint gradient can still match the FD gradient in this case (assuming they converged to the same relative value).

In other words, the big problem here might be due to field convergence. From your code above, it seems that you use the same convergence threshold (1e-4) for both experiments. But our current method to check convergence isn't necessarily foolproof.

Consequently, it's better to first check whether the narrowband cost function value matches the broadband cost function (obviously looking only at the center frequency). Then, do a convergence check with smaller tolerances. If everything looks good, it's finally worth doing a FD gradient comparison at this point. After all of that, we should have enough to figure out what's going on (I hope)!

In summary, I think the issue is with inconsistent field convergence, which aligns with what you are seeing -- the smaller bandwidths run for a longer period of time. In theory, the convergence routine shouldn't care as it just measures the relative change in the FT fields after the specified time elapses, but there may be a bug here.

I'm going to reopen since we've actually seen this ourselves (and it's not just user error).