flexcompute / tidy3d

Fast electromagnetic solver (FDTD) at scale.
https://www.flexcompute.com/tidy3d/solver/
GNU Lesser General Public License v2.1
165 stars 40 forks source link

Continue a complete FDTD run #874

Open tomflexcompute opened 1 year ago

tomflexcompute commented 1 year ago

For simulations that require a long run_time (high Q devices, very long devices, etc), it could be a bit hard to estimate a good run_time. Therefore, we sometimes run into the issue that at the end of the time stepping, field_decay is not as low as we would like it to be. In principle, we can provide a feature that allows continuing running on a finished simulation for the field to further decay.

tylerflex commented 1 year ago

in principle this is what the Simulation.shutoff parameter does. But maybe there can be some setting in Simulation.run_time eg if Simulation.run_time=None, then we just run until shutoff is achieved? I think there are a couple possible complications to consider to avoid infinite runs:

  1. we should auto shutoff if we detect divergence
  2. how do we set the frequency at which we check field decay. It could make sense to just pick a default value based on the optical period corresponding to the sources' source times? (eg every 20 optical periods?) in fact, that might be nice in general to have because having to be careful about not setting run_time too long and overshooting the field decay is definitely a somewhat annoying aspect of running tidy3d simulations. I think setting a default, safe interval for checking the field decay would be nice to have in general.
momchil-flex commented 1 year ago

in principle this is what the Simulation.shutoff parameter does.

The problem is that in some cases the run_time may not be long enough, while on the other hand it's not great for users (and for us) to set a run time that is way too long (several orders of magnitude larger than needed). From our perspective this makes resource allocation less efficient, and from the user perspective, they get charge a minimum of 10% of the expected run time to account for this overhead on our side.

But maybe there can be some setting in Simulation.run_time eg if Simulation.run_time=None, then we just run until shutoff is achieved? I think there are a couple possible complications to consider to avoid infinite runs:

  1. we should auto shutoff if we detect divergence
  2. how do we set the frequency at which we check field decay. It could make sense to just pick a default value based on the optical period corresponding to the sources' source times? (eg every 20 optical periods?) in fact, that might be nice in general to have because having to be careful about not setting run_time too long and overshooting the field decay is definitely a somewhat annoying aspect of running tidy3d simulations. I think setting a default, safe interval for checking the field decay would be nice to have in general.

Re frequency of the checking, the reduction can add a noticeable slowdown if done too often, which is why we only do it every 4%. We could consider doing it every 2%, or even 1% of the run time (taking a bit of a hit on speed), but may not want to do it more often than that... But, some experimenting is needed, your suggestion is definitely interesting.

Generally you are presenting an interesting alternative solution to allowing a simulation to continue, which is what the issue is originally about. I guess each has its pros and cons. The main problem with continuing a simulation is that it would require a lot of data to be stored and transferred to and from S3 (cost). So probably the only feasible way to have something like this is for the user to actively set a kwarg e.g. web.run(sim, allow_continue=True), otherwise most simulations, which will not need to be continued, will still be incurring extra costs to us and the user.

Your suggestion is definitely convenient from a user perspective when it works right, but there are indeed some edge cases where it may be tricky either for us or for the user. Also, from our perspective it makes resource allocation harder (we can do it purely based on the number of grid points, but then some long run_time simulations may be put on small workers and take quite long, for example). But anyway, going back to the edge cases, e.g.

tylerflex commented 1 year ago

What I always found confusing was why do the checking every "x"% of the total run time, instead of every "n" time steps (for example). I assume the former is to keep the checking time low compared to the total simulation time, but the latter seems much more stable and able to resolve decay. It also seems if you check every n time steps, you might add a % increase to the total simulation time that is somewhat constant with respect to the number of time steps right? (and could therefore be contained to a reasonable level, like < 1%?). I think if we can get that part down, allowing for continuous running of simulation (until decay) seems relatively straightforward.

momchil-flex commented 1 year ago

What I always found confusing was why do the checking every "x"% of the total run time, instead of every "n" time steps (for example).

Yeah that's probably a deep legacy maybe from before we even had "shutoff" and just wanted to log progress in % done.

I assume the former is to keep the checking time low compared to the total simulation time, but the latter seems much more stable and able to resolve decay. It also seems if you check every n time steps, you might add a % increase to the total simulation time that is somewhat constant with respect to the number of time steps right? (and could therefore be contained to a reasonable level, like < 1%?).

Yeah I think this could be nice, just need to experiment to find a good way to set it. I do wonder now if it should be a fixed # of time steps as opposed to something that depends on the simulation details like the source. The number of time steps per optical cycle depends for example on the spatial resolution, which forces the time step to be shorter compared to the optical pulse, so again you could have two simulations with the same number of cells and number of time steps, but checking at very different intervals. So the most constant thing is to just define a fixed # of time steps?

I think if we can get that part down, allowing for continuous running of simulation (until decay) seems relatively straightforward.

What do you mean, what about the edge cases where this could lead to very long run time unnecessarily?

fhernandez93 commented 5 months ago

Hello, will this actually be implemented? I would like to test the convergence of the transmittance through a huge structure at different simulation times without having to run the simulation from zero each time.

momchil-flex commented 5 months ago

We want to have some version of this in the long run, but it is not on our road map currently.

In what you want to do, can't you just use a FluxTimeMonitor at some specific times?

Alternatively, if you're not looking for the instantaneous transmission but rather for the frequency-domain transmission up to that point in time, you could set multiple flux monitors and use ApodizationSpec to exclude a part of the simulation run time. However note that I am really not sure if this approach would have a meaningful physical interpretation.

fhernandez93 commented 5 months ago

Thanks for your reply.

Although I understand that my approach may not align with physical accuracy, I find it necessary as I utilize two monitors positioned on either side of a structure with a bandgap to efficiently determine when there's a satisfactory agreement with Poynting's theorem, so we can say there is good convergence of the simulation. We take this approach as we are looking to determine the best runtime for our simulations (which can be huge and costly), so we don’t waste unnecessarily the resources.

Also, one thing we have also observed using Meep is that despite the fact that the transmission measurer at both monitors should give exactly the same results (Poynting theorem), when there's a bandgap we have noticed that the difference in the results of both monitors can reach several orders of magnitude even for very long runs. I suppose this is an effect due to discretization. I'd like to know if you are aware of this effect (I didn’t find anything in the literature, but didn't search too extensively) and how could be cured if possible at all.

Best,

Francisco.

On Thu 15 Feb 2024 at 7:17 PM, momchil-flex @.***> wrote:

We want to have some version of this in the long run, but it is not on our road map currently.

In what you want to do, can't you just use a FluxTimeMonitor at some specific times?

Alternatively, if you're not looking for the instantaneous transmission but rather for the frequency-domain transmission up to that point in time, you could set multiple flux monitors and use ApodizationSpec to exclude a part of the simulation run time. However note that I am really not sure if this approach would have a meaningful physical interpretation.

— Reply to this email directly, view it on GitHub https://github.com/flexcompute/tidy3d/issues/874#issuecomment-1946843499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOT4GRNJUCZXGX565NXO3U3YTZGMZAVCNFSM6AAAAAAX75QOJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBWHA2DGNBZHE . You are receiving this because you commented.Message ID: @.***>

momchil-flex commented 5 months ago

Thanks for your reply. Although I understand that my approach may not align with physical accuracy, I find it necessary as I utilize two monitors positioned on either side of a structure with a bandgap to efficiently determine when there's a satisfactory agreement with Poynting's theorem, so we can say there is good convergence of the simulation. We take this approach as we are looking to determine the best runtime for our simulations (which can be huge and costly), so we don’t waste unnecessarily the resources.

I see, so my suggestion to have one long simulation with multiple monitors with different apodization spec doesn't really work for you, you want to keep running the time stepping after examining the current results, etc. We may have this capability at some point but not currently unfortunately.

Also, one thing we have also observed using Meep is that despite the fact that the transmission measurer at both monitors should give exactly the same results (Poynting theorem), when there's a bandgap we have noticed that the difference in the results of both monitors can reach several orders of magnitude even for very long runs. I suppose this is an effect due to discretization. I'd like to know if you are aware of this effect (I didn’t find anything in the literature, but didn't search too extensively) and how could be cured if possible at all.

Actually I don't really understand the setup and the issue you are describing. If you want me to think about it I think I'd need a more concrete example of what you are simulating and what you expect to happen and why.

momchil-flex commented 5 months ago

The images are not showing for some reason...

fhernandez93 commented 5 months ago

Sorry, let me explain with a bit more detail: I have a setup like this (the structure is a disordered dielectric network): image

In 3d the structure is something like this:

image

And according to Poynting's theorem the flux should be conserved across the volume, so we should obtain the same results on both monitors, but for this kind of structures we always obtain something like this:

image

Let me know if this makes sense.

momchil-flex commented 5 months ago

I see. The only thing that comes to mind is long-lived resonances. It does seem like the flux is conserved as expected at the small and large wavelength ends - where presumably there isn't much light going back and forth? So it may be a matter of having to wait a really long time...