Different results in different runs when divide_parallel_processes is used

mawc2019 commented 3 years ago

When using the Near2FarFields or FourierFields adjoint solver with divide_parallel_processes, I get different results in different runs. The differences may be obvious after a number of steps during optimization.

smartalecH commented 3 years ago

Are you using your version of the DFT fields that uses the same IndexSrc as the FarFields adjoint source? I'm guessing the issue lies there, since I've never seen this with Eigenmode adjoint sources (and I run with multiple groups quite regularly).

mawc2019 commented 3 years ago

I was using two different versions of the FourierFields adjoint solver. One of them is the version in the master branch that does not use the IndexedSource, and the other is my modified version that uses IndexedSource. This problem occurs in both versions.

smartalecH commented 3 years ago

One of them is the version in the master branch that does not use the IndexedSource

Hmm that's odd. Maybe try two things:

First, swap out the FourierFields quantity for an Eigenmode quantity. If you still get a discrepancy, then it's probably not related to the adjoint solver itself.

Also, can you post a MWE (I think you posted some code on a different issue with a similar problem)? It would be interesting to see if you are doing any rank dependent processing (e.g. with am_master() ).

smartalecH commented 3 years ago

I was using two different versions of the Near2FarFields adjoint solver. One of them is the version in the master branch that does not use the IndexedSource

Just reread this. Are you sure you are using the master branch for Near2Far adjoints? The master branch does use the IndexedSource. Did you mean to say you have a custom branch that somehow does not use an IndexedSource, and it still fails?

mawc2019 commented 3 years ago

I am sorry that it was a typo. What I meant was FourierFields. It is corrected now. Although I mention this problem in my reply in #1676, the issue there seems to be unrelated.

mawc2019 commented 3 years ago

Hi @smartalecH , this problem also occurs in EigenmodeCoefficient. A small example is as follows. (I have revised my previous posts in this issue in case they are misleading.)

import meep as mp
import meep.adjoint as mpa
import numpy as np
from autograd import numpy as npa

Si = mp.Medium(index=3.48)
SiO2 = mp.Medium(index=1.45)
Air = mp.Medium(index=1)

num_groups = 4
n = mp.divide_parallel_processes(num_groups)

resolution,design_resolution = 20,20
design_region_width,design_region_height = 2.0,0.2
Lx,Ly = design_region_width+0.0,4
dpml = 1.5

monitor_height,monitor_width = design_region_height/2,Lx
pml_layers = [mp.PML(dpml, direction=mp.Y)]
cell_size = mp.Vector3(Lx,Ly+2*dpml)

fcen,fwidth = 1/1.55,0.01/1.55
k = mp.Vector3(fcen*1.45).rotate(mp.Vector3(z=1), np.radians(100))
source_center,source_size  = mp.Vector3(0,-Ly/4),mp.Vector3(Lx,0)
src = mp.GaussianSource(frequency=fcen,fwidth=fwidth)
source = [mp.EigenModeSource(src,eig_band = 1,direction=mp.NO_DIRECTION,eig_kpoint=k,size = source_size,center=source_center)]

Nx,Ny = int(round(design_resolution*design_region_width)),1

design_variables = mp.MaterialGrid(mp.Vector3(Nx,Ny),Air,Si,grid_type='U_MEAN')
design_region = mpa.DesignRegion(design_variables,volume=mp.Volume(center=mp.Vector3(0,design_region_height/2),size=mp.Vector3(design_region_width,design_region_height)))

geometry = [mp.Block(center=mp.Vector3(0,-Ly/4-dpml/2), material=SiO2, size=mp.Vector3(Lx,Ly/2+dpml, 0)), # subtrate
    mp.Block(center=design_region.center, size=design_region.size, material=design_variables)] # design region

sim = mp.Simulation(
    cell_size=cell_size,boundary_layers=pml_layers,geometry=geometry,sources=source,k_point=k,eps_averaging=False,resolution=resolution)
TE0 = mpa.EigenmodeCoefficient(sim,mp.Volume(center=mp.Vector3(0,monitor_height),size=mp.Vector3(monitor_width,0)),mode=1)
ob_list = [TE0]

def J(alpha):
    return npa.abs(alpha)**2

opt = mpa.OptimizationProblem(
    simulation = sim,objective_functions = J,objective_arguments = ob_list,design_regions = [design_region],
    fcen=fcen,df = 0,nf = 1,decay_fields=[mp.Ez],maximum_run_time = 2000)

x = 0.5*np.ones(Nx*Ny)
f_single, dJ_single = opt([x])
f_multiple = mp.merge_subgroup_data(f_single)
dJ_multiple = mp.merge_subgroup_data(dJ_single)
np.savetxt("emc_value_0819.out",f_multiple) # save the function values for comparison
np.savetxt("emc_grad_0819.out",dJ_multiple) # save the gradients for comparison

When I run the code multiple times, the function values f_multiple may not remain the same every time. In addition, all entries of f_multiple should be identical but sometimes they are not. Likewise, the gradients dJ_multiple may not remain the same every time. All columns of dJ_multiple should be identical but sometimes they are not.

mawc2019 commented 3 years ago

The indeterminacy is not caused by the adjoint solver, because it also happens in simulation without the adjoint solver. An example similar to the previous one is as follows.

import meep as mp
import numpy as np

num_groups = 4
n = mp.divide_parallel_processes(num_groups)

resolution = 20
Lx,Ly = 2,4
dpml = 1.5
monitor_height,monitor_width = 0.1,Lx
pml_layers = [mp.PML(dpml, direction=mp.Y)]
cell_size = mp.Vector3(Lx,Ly+2*dpml)

fcen,fwidth = 1/1.55,0.01/1.55
theta_in = np.radians([90,100,110,120])
k = mp.Vector3(fcen*1.45).rotate(mp.Vector3(z=1), theta_in[n])

source_center,source_size  = mp.Vector3(0,-Ly/4),mp.Vector3(Lx,0)
src = mp.GaussianSource(frequency=fcen,fwidth=fwidth)
source = [mp.EigenModeSource(src,eig_band = 1,direction=mp.NO_DIRECTION,eig_kpoint=k,size = source_size,center=source_center)]

geometry = [mp.Block(center=mp.Vector3(), material=mp.Medium(index=1), size=mp.Vector3(Lx,Ly+2*dpml, 0))]

sim = mp.Simulation(
    cell_size=cell_size,boundary_layers=pml_layers,geometry=geometry,sources=source,k_point=k,eps_averaging=False,resolution=resolution)

mode_monitor = sim.add_mode_monitor(fcen,0,1,mp.ModeRegion(center=mp.Vector3(0,monitor_height),size=mp.Vector3(monitor_width,0)),yee_grid=True)
sim.run(until_after_sources=100)

emc = sim.get_eigenmode_coefficients(mode_monitor, [1], direction=mp.AUTOMATIC)[0]
emc_abs = np.abs(mp.merge_subgroup_data(emc)).flatten()
np.savetxt("emc_test_0820.out",emc_abs)

smartalecH commented 3 years ago

Looks like your mode source has a different k_point on each processor group, right?

k = mp.Vector3(fcen*1.45).rotate(mp.Vector3(z=1), theta_in[n])

That will launch a different mode source in each sim, which should explain the discrepancy.

smartalecH commented 3 years ago

But if I run your adjoint code enough times, I do start to see some discrepancies in f_multiple among the processor groups, although it's pretty small:

[98319.49482893 98319.49676238 98319.49676238 98319.49676238]

Interestingly, it's the am_really_master group that is slightly different. Not sure where the discrepancy could be coming from (especially since the first few times I run it, the results are identical -- perhaps there's some weird caching going on under the hood that is not initially broadcasted to all the groups?)

cc @stevengj

mawc2019 commented 3 years ago

Looks like your mode source has a different k_point on each processor group, right?

Yes, this is the case in the second example, and the results for different k_point are not expected to be the same. I also tried to use the same k_point for each parallel process, just like that in the first example. But I did not see varying results in multiple runs. (The difference might appear if I ran the code more times.) Therefore, I used different k_point for each parallel process in the second example.

smartalecH commented 3 years ago

Yes, this is the case in the second example, and the results for different k_point are not expected to be the same. I also tried to use the same k_point for each parallel process, just like that in the first example. But I did not see varying results in multiple runs. (The difference might appear if I ran the code more times.) Therefore, I used different k_point for each parallel process in the second example.

I'm a bit confused. Doesn't this mean that the divide_parallel_processes() function is correctly working, since you were unable to see a discrepancy outside of the adjoint solver? In other words, the discrepancy really is inside the adjoint solver itself?

mawc2019 commented 3 years ago

Doesn't this mean that the divide_parallel_processes() function is correctly working,

Up to now, I have only seen the indeterminacy when divide_parallel_processes() is used. I tried to illustrate the indeterminacy without divide_parallel_processes(), but it did not occur. Even so, it does not mean that the indeterminacy is rooted in divide_parallel_processes().

since you were unable to see a discrepancy outside of the adjoint solver?

The second example shows that the indeterminacy also happens when the adjoint solver is not used.

In other words, the discrepancy really is inside the adjoint solver itself?

As the second example shows, the indeterminacy also happens when the adjoint solver is not used, so I think this issue is not caused by the adjoint solver. However, when optimization (using the adjoint solver) is performed, the effect of the tiny discrepancy may sometimes accumulate and become conspicuous. In other words, optimization (using the adjoint solver) may be more susceptible to the indeterminacy compared with conventional simulation.

stevengj commented 3 years ago

I just noticed that the MPB eigensolver can re-initialize with pseudo-random numbers under some circumstances: https://github.com/NanoComp/mpb/blob/af31eb16e26e3e213b49c89bf68cd512fb29e18c/src/matrices/eigensolver.c#L342-L343

This could cause slight variations between runs. Maybe we could try replacing that with something deterministic? (This affects both eigenmode sources and eigenmode coefficients, but won't affect you if you aren't using eigenmodes.)

mawc2019 commented 3 years ago

I replace the pseudo-random numbers with determinate numbers, but the problem of indeterminacy remains. Moreover, the indeterminacy also happens when Near2FarFields or FourierFields adjoint solvers are used without eigenmode sources.

mawc2019 commented 3 years ago

In previous examples, the indeterminacy only causes small variations among different runs. However, large variations may also appear. In the example below, the adjoint gradient g_adjoint can vary from normal values to enormous numbers or nan. Those abnormal values of g_adjoint might share the same origin as #1676. In addition, in previous examples, the indeterminacy is observed only when divide_parallel_processes is used, but at least in the current version, the indeterminacy could happen without using divide_parallel_processes, as also shown in the example below.

import meep as mp
import meep.adjoint as mpa
import numpy as np
from autograd import numpy as npa
Si, SiO2 = mp.Medium(index=3.4), mp.Medium(index=1.44)

resolution,design_resolution = 20,20
Lx,Ly,pml_size = 2,4,1
cell_size = mp.Vector3(Lx,Ly+2*pml_size)
pml_layers = [mp.PML(pml_size, direction=mp.Y)]
design_width,design_height = 2,0.4

fcen,fwidth = 1/1.55,0.01/1.55
source_center,source_size = mp.Vector3(0,-Ly/4),mp.Vector3(Lx,0)

src = mp.GaussianSource(frequency=fcen,fwidth=fwidth)
source = [mp.Source(src,component=mp.Ez,center=source_center,size=source_size)]
Nx,Ny = int(round(design_resolution*design_width)),int(round(design_resolution*design_height))

design_variables = mp.MaterialGrid(mp.Vector3(Nx,Ny),SiO2,Si,grid_type='U_MEAN')
design_region = mpa.DesignRegion(design_variables,volume=mp.Volume(center=mp.Vector3(), size=mp.Vector3(design_width, design_height)))

geometry = [mp.Block(center=design_region.center, size=design_region.size, material=design_variables)] # design region
sim = mp.Simulation(
    cell_size=cell_size,boundary_layers=pml_layers,geometry=geometry,
    sources=source,default_material=SiO2,resolution=resolution)

far_pt = [mp.Vector3(0,20,0)]
NearRegions = [mp.Near2FarRegion(center=mp.Vector3(0,design_height+0.2), size=mp.Vector3(Lx,0), weight=+1)]
FarFields = mpa.Near2FarFields(sim, NearRegions,far_pt)
ob_list = [FarFields]

def J(alpha):
    return npa.abs(alpha[0,0,2])**2

opt = mpa.OptimizationProblem(
    simulation = sim,objective_functions = J,objective_arguments = ob_list,
    design_regions = [design_region],fcen=fcen,df = 0,nf = 1,decay_by = 1e-5)

x0 = 0.5*np.ones(Nx*Ny)
opt.update_design([x0])
f0, g_adjoint = opt()

np.random.seed(1905)
g_discrete, idx = opt.calculate_fd_gradient(num_gradients=5,db=1e-4)
g_discrete = np.array(g_discrete).flatten()

print("Randomly selected indices: ",idx)
print("Finite-difference gradient: ",g_discrete)
print("Adjoint gradient: ",g_adjoint[idx])

smartalecH commented 3 years ago

It seems like the next step is to carefully trace where in the gradient calculation the results start to diverge between groups. Since you have a test case where the results are so different, the process should be a lot simpler (or at least our error should be more obvious than before). Here are the key places in the code I would check:

The forward run (here, here, and here)
The calculation of the scaling for the adjoint sources (here)
The adjoint fields themselves (here)
The recombination step (this one is really ugly, so let's hope we find the issue above).

My guess is that it's independent of the adjoint solver, so hopefully we track the issue in step 1. But if not, we have a way to carefully see where things are going wrong.

mawc2019 commented 3 years ago

In this example, the forward run does not produce abnormal values, and the results of the forward run do not change even when the results of the adjoint run change violently. I checked the scaling for the adjoint sources via printing the results at here. These results do not change, either. The problem appears in the adjoint fields. I saw the indeterminacy and abnormal values when printing the results of get_dft_array at here.

My guess is that it's independent of the adjoint solver

Probably, as shown in the example on August 20th.

smartalecH commented 3 years ago

Hmm so the adjoint scaling is consistent, but the fields that are recorded are different.

So now we need to see if the adjoint sources themselves are identical between groups, or if there is a bug in the dft monitors between groups.

As a quick test, maybe take a snapshot of the adjoint simulation (with the fields plotted) on each group a few seconds after the simulation starts. Hopefully, we see a big difference. plot2D() will plot the fields/simulation domain for each group:

opt.prepare_forward_run()
opt.forward_run()
opt.prepare_adjoint_run()
opt.sim.reset_meep()
opt.sim.change_sources(opt.adjoint_sources[0])
opt.sim.run(until=5)
if mp.am_master():
    plt.figure()
    opt.sim.plot2D(fields=mp.Hz)
    ptl.savefig("fields_{}.png".format(my_group))
plt.show()

mawc2019 commented 3 years ago

Before plotting figures, I changed fields=mp.Hz to fields=mp.Ez in order to be consistent with our setup in the above example. When running opt.sim.plot2D(fields=mp.Ez), I got

  File "/home/gridsan/mawc/install/lib/python3.7/site-packages/meep/simulation.py", line 4151, in plot2D
    **kwargs)
  File "/home/gridsan/mawc/install/lib/python3.7/site-packages/meep/visualization.py", line 564, in plot2D
    ax = plot_sources(sim,ax,output_plane=output_plane,labels=labels,source_parameters=source_parameters)
  File "/home/gridsan/mawc/install/lib/python3.7/site-packages/meep/visualization.py", line 442, in plot_sources
    vol = Volume(center=src.center,size=src.size)
AttributeError: 'IndexedSource' object has no attribute 'center'

So I canceled plot_sources in visualization.py. Then I ran the code multiple times, but the figures did not change in different runs. I also changed opt.sim.run(until=5) to opt.sim.run(until=500). The situation remained the same.

smartalecH commented 3 years ago

Okay as a true check, we should pull the raw time-domain fields (at any arbitrary time) and make sure they are consistent across all groups.

But it seems that something is going on with the recording of the fields themselves (within the DFT monitors).

smartalecH commented 3 years ago

Any updates on this?

mawc2019 commented 2 years ago

I have not seen this problem again when running this example on the current master branch.

In previous examples, the indeterminacy only causes small variations among different runs. However, large variations may also appear. In the example below, the adjoint gradient g_adjoint can vary from normal values to enormous numbers or nan. Those abnormal values of g_adjoint might share the same origin as #1676. In addition, in previous examples, the indeterminacy is observed only when divide_parallel_processes is used, but at least in the current version, the indeterminacy could happen without using divide_parallel_processes, as also shown in the example below.

import meep as mp
import meep.adjoint as mpa
import numpy as np
from autograd import numpy as npa
Si, SiO2 = mp.Medium(index=3.4), mp.Medium(index=1.44)

resolution,design_resolution = 20,20
Lx,Ly,pml_size = 2,4,1
cell_size = mp.Vector3(Lx,Ly+2*pml_size)
pml_layers = [mp.PML(pml_size, direction=mp.Y)]
design_width,design_height = 2,0.4

fcen,fwidth = 1/1.55,0.01/1.55
source_center,source_size = mp.Vector3(0,-Ly/4),mp.Vector3(Lx,0)

src = mp.GaussianSource(frequency=fcen,fwidth=fwidth)
source = [mp.Source(src,component=mp.Ez,center=source_center,size=source_size)]
Nx,Ny = int(round(design_resolution*design_width)),int(round(design_resolution*design_height))

design_variables = mp.MaterialGrid(mp.Vector3(Nx,Ny),SiO2,Si,grid_type='U_MEAN')
design_region = mpa.DesignRegion(design_variables,volume=mp.Volume(center=mp.Vector3(), size=mp.Vector3(design_width, design_height)))

geometry = [mp.Block(center=design_region.center, size=design_region.size, material=design_variables)] # design region
sim = mp.Simulation(
    cell_size=cell_size,boundary_layers=pml_layers,geometry=geometry,
    sources=source,default_material=SiO2,resolution=resolution)

far_pt = [mp.Vector3(0,20,0)]
NearRegions = [mp.Near2FarRegion(center=mp.Vector3(0,design_height+0.2), size=mp.Vector3(Lx,0), weight=+1)]
FarFields = mpa.Near2FarFields(sim, NearRegions,far_pt)
ob_list = [FarFields]

def J(alpha):
    return npa.abs(alpha[0,0,2])**2

opt = mpa.OptimizationProblem(
    simulation = sim,objective_functions = J,objective_arguments = ob_list,
    design_regions = [design_region],fcen=fcen,df = 0,nf = 1,decay_by = 1e-5)

x0 = 0.5*np.ones(Nx*Ny)
opt.update_design([x0])
f0, g_adjoint = opt()

np.random.seed(1905)
g_discrete, idx = opt.calculate_fd_gradient(num_gradients=5,db=1e-4)
g_discrete = np.array(g_discrete).flatten()

print("Randomly selected indices: ",idx)
print("Finite-difference gradient: ",g_discrete)
print("Adjoint gradient: ",g_adjoint[idx])

NanoComp / meep

Different results in different runs when divide_parallel_processes is used #1724