NanoComp / meep

free finite-difference time-domain (FDTD) software for electromagnetic simulations
GNU General Public License v2.0
1.2k stars 612 forks source link

Improved MPI performance #2445

Open smartalecH opened 1 year ago

smartalecH commented 1 year ago

I think there may be a few ways we can improve performance of our chunk halo exchanges (done with MPI).

First, we can use remote access memory (RMA), or "one-sided" access to perform the halo exchange. This is a feature that's been supported by many MPI distributions for quite some time now. The literature seems to suggest some pretty impressive improvements compared with our current two-sided communication approach.

Second, are we currently forced to perform halo updates for chunks that live on the same proc? If so, is there a more efficient way to maybe "index into" a global array that chunks share? (I realize this gets hairy when we try to use the same iterator for all the fields used in the chunk's timestepping kernel, and that different chunks will have different kernels requiring multiple iterators per chunk... but maybe it's worth it?) I probably don't grasp all the nuances here...

stevengj commented 1 year ago

We can definitely use one-sided access functions. One wrinkle is that we don't want to read the boundary values point by point — to cut down on the latency, I think we really need to collect the boundary values into a single buffer before communicating them (as we do now). Once you do that, I'm not sure how much you'll gain over MPI_Sendrecv. Should be reasonable to attempt, though.

The halo updates for chunks that live on the same processor don't use MPI, it just does a memcpy, so the main cost is collecting the boundary values to/from buffers. In principle we could have shared arrays for the fields, instead of multiple chunks, and only have separate arrays for the PML auxiliary fields, but this gets a bit hairy… seems like a big pain?