NanoComp / meep

free finite-difference time-domain (FDTD) software for electromagnetic simulations
GNU General Public License v2.0
1.24k stars 626 forks source link

Feature Request: Parallel near2far calculation #130

Closed erikshipton closed 6 years ago

erikshipton commented 6 years ago

The near2far calculation for 3D geometries takes a very long time (often longer than the simulation itself). It would be nice to get this part of the code parallelized. This would be especially useful for metasurface antenna type problems.

Another option might be to export the DFT'd nearfields in a readable format and we could write our own parallel near2far transformation in matlab/python. Currently we need to save the full nearfield for every timestep to do the transform which results in a lot of data and slows the calculation.

stevengj commented 6 years ago

The calculation is parallelized. However, it may not be load-balanced very well.

The problem is that Meep chops up the geometry into roughly equal chunks. The near2far calculation is performed in parallel, with each chunk computing its contribution to the far field. However, if most of the near2far source region is located in a single chunk, then it all computed on that chunk.

It would be nice if Meep could be smarter about how it divides up the geometry among the processors.

erikshipton commented 6 years ago

It sounds like my near2far geometries are always ending up in a single chunk (a 2D slice of a 3D geometry). Is there anyway I can place my near2far surface intelligently such that its divided evenly among the chunks?

Or is there anyway to "re-chunk" or re-divide the problem for this calculation?

oskooi commented 6 years ago

We recently added an FAQ which explains why parallelization may not speedup calculations involving near2far and other parallelized routines.

stevengj commented 6 years ago

Currently there is no way to customize how the computational cell is divided among the processes.

erikshipton commented 6 years ago

Is there any way to dechunk the problem after running? Somehow reset the number of processors to 1? If that is possible we can use python's multiprocessing library to parallelize the near2far transformation by calling a bunch of get_farfield() in parallel. Its similar to matlab's parfor or gnu's parallel function. Python's parallelization gives really nice speedup. The problem is that when using multiprocessing along with mpi I overload the processor (each python subprocess call launches a get_farfield() that is using the number of processors that was called in mpirun).

On Fri, Mar 9, 2018 at 8:04 AM, Steven G. Johnson notifications@github.com wrote:

Currently there is no way to customize how the computational cell is divided among the processes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stevengj/meep/issues/130#issuecomment-371855895, or mute the thread https://github.com/notifications/unsubscribe-auth/AJxSzwYCIyyLz68pei9C2bZvg5u2DXM_ks5tcqgQgaJpZM4QinMg .

stevengj commented 6 years ago

"dechunking" would involve changing the data distribution. While anything is technically possible, this is slow and potentially dangerous because the simulation may require more memory than is available on a single processor.

If you are willing to store the entire simulation in the memory of a single compute node, then a better approach might be to support something like OpenMP (shared-memory threading) as an alternative to (or in addition to) MPI parallelism. In principle, this could be pretty straightforward (just adding #pragma directives in front of key loops), and OpenMP will automatically load balance.

erikshipton commented 6 years ago

Is there any wonky work arounds that I might be able to use? Maybe saving nearfields, closing meep, reimporting serial meep (or parallel with np=1), opening the nearfields file, and calling get_farfield() from python's multiprocessing?

Currently I am willing to store the entire simulation in a single compute node ( i assume one node still has many cores for parallelization) and use OpenMP. There might be future problems where the memory requirements might make this not work but I'm not there yet. All my problems at the moment are compute limited not memory limited.

Thanks for thinking about this problem.

Erik

On Fri, Mar 9, 2018 at 9:05 AM, Steven G. Johnson notifications@github.com wrote:

"dechunking" would involve changing the data distribution. While anything is technically possible, this is slow and potentially dangerous because the simulation may require more memory than is available on a single processor.

If you are willing to store the entire simulation in the memory of a single compute node, then a better approach might be to support something like OpenMP (shared-memory threading) as an alternative to (or in addition to) MPI parallelism. In principle, this could be pretty straightforward (just adding #pragma directives in front of key loops), and OpenMP will automatically load balance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stevengj/meep/issues/130#issuecomment-371875664, or mute the thread https://github.com/notifications/unsubscribe-auth/AJxSz1w_5vaDCEGw_7TzXG5In3NN3QFoks5tcrZBgaJpZM4QinMg .

stevengj commented 6 years ago

No, short of implementing OpenMP support yourself or something similar coding-intensive.

soamaven commented 5 years ago

I'm struggling with near2far performance also. I am looping over phi, theta, and r at a reasonable resolution. In 2D simulations this is not a significant limitation. For 3D simulations and radiation fields on a spherical surface, this can take nearly as long as the FDTD simulation itself (or longer for higher angular resolution). Unfortunately, my specific geometry has broken symmetry, so I can't be clever there.

get_farfield natively returns the number of frequencies nfreq that the near2far flux plane was instantiated to record -- so this is handled at a low-level. However, repeated calls in such a loop to different points on a hemisphere could conceivable be adding significant overhead. Could the repeated calls to get_farfield be contributing significantly to this bottleneck? I am using the python interface.

Perhaps I am doing something wrong/inefficiently. Any recommendations? Does it make sense to ask for an expansion of get_farfield to accept an array of points, and handle those at a lower level?

I've messed around with adding OpenMP pragmas, but never could figure out which loops would benefit most as the source stands. I am supposing that wrapping OpenMP around a loop of far-field points could be a good route here.

stevengj commented 5 years ago

The main loops that need to be parallelized are this one and this one.

Unfortunately the OpenMP pragmas don't work that well for nested loops like the one created by the LOOP_OVER_IVECS macro, so you'd need something like #565.

Our upcoming work on better load-balancing will help here for the MPI case, alternatively.