Closed stevengj closed 3 years ago
It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.
I actually implemented hybrid openMP/MPI support for a class project last year. My branch is here.
In short, I updated all the LOOP macros to support openMP. I had to make several other changes to prevent things like false sharing. Unfortunately, the performance was almost always worse with OpenMP. Here is a report describing everything I did, along with several numerical experiments. Although I'm sure a more careful approach would yield better results.
this would allow us to have only one copy per node of a cluster.
I think we discussed implementing this feature when we add the backprop step to the new subpixel averaging feature. We need to revamp quite a few things for this to work (e.g. storing the forward run fields efficiently even after we reset meep to do an adjoint run). Allocating the initial material grid on just one node in python is a bit hairy too (note that cluster job managers typically allocate memory/proc -- it's somewhat tricky to allow proc memory allocations to spill over each other even within the same node). Would the optimizer just run on one proc? Either way, memory management gets tricky with that approach too.
But I agree, it's an important feature.
The point is that if we use OpenMP to multi-thread within the node, and only use MPI across nodes, then it will automatically allocate the material grid once per node and have one NLopt instance per node — from Meep's perspective, it is really just a single process except that some of the loops are faster.
I think this is worth having even if it is currently somewhat slower than pure MPI, because it allows vastly better memory efficiency (for large material grids) with minimal effort.
(An alternative would be to use explicit shared-memory allocation, e.g. with MPI.Win.Create()
in mpi4py, plus a whole bunch of additional synchronization to ensure that we only use one NLopt instance per node. (NLopt also allocates data proportional to the number of DOF!) Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks if they are not doing the same things on all processes.)
then it will automatically allocate the material grid once per node and have one NLopt instance per node
Got it. My branch should be able to do exactly what you are describing.
even if it is currently somewhat slower than pure MPI
I'd say it's worse than "somewhat slower". Note that for some test examples (on a single node with 40 cores) MPI achieved 20x speedup, while openMP would only achieve 5x speedup. Plus, the initialization process was significantly slower for openMP (even though I wrapped all of those loops too).
That being said, maybe @mochen4 can try using my branch as an initial investigation. It only needs minor rebasing.
We can then (hopefully) identify the bottlenecks and resolve them after some experimentation. @oskooi it might be worth pulling in some extra resources toward investigating this. Especially since I already have a rather mature starting point and we only need to fine-tune at this point.
Worse, this synchronization would leak into user code — the user Python scripts would have to be carefully written to avoid deadlocks
I agree, this sounds painful...
On MIT Supercloud, configured with both "with-openmp" and "with-mpi", the following make check
python tests failed (using RUNCODE in each case accordingly https://github.com/NanoComp/meep/issues/1671) :
1 process and 1 thread: adjoint_jax, mode_coeffs, mode_decomposition, source: SegFault simulation: under "with-mpi", it required more than 1 process https://github.com/NanoComp/meep/blob/89c93498ead04290b42acbb35b2c09c9813e04eb/python/tests/test_simulation.py#L126-L128
2 process and 1 thread: Same SegFault tests as above.
1 process and 2 thread: adjoint_jax, get_point, mode_decomposition, multilevel_atom, source, user_defined_material: Segfault adjoint_solver, bend_flux, material_grid, simulation: AssertionError (i.e. results mismatch) mode_coeffs: Singular matrix in inverse
2 process and 2 thread: Same as above
With the master branch, adjoint_jax, mode_coeffs, mode_decomposition, and source also failed with SegFault. All C++ tests passed in all cases.
If you add --enable-debug
to the configure flag, can you get a stack trace from the segfault to find out where the crash is? See https://www.open-mpi.org/faq/?category=debugging#general-parallel-debugging
It might be possible to automatically print a stacktrace on a segfault: https://stackoverflow.com/questions/77005/how-to-automatically-generate-a-stacktrace-when-my-program-crashes
See if you can reproduce the problem on your local machine rather than supercloud.
Try the suggestion here: https://github.com/NanoComp/meep/pull/1628/files#r670728520
Hello there! Let me thank you so much for all the effort you put in Meep; it enabled me to speed up thing a lot on the HPC cluster I use, with respect to the licence-number limitations of Lumerical. I am presently working on topology optimizations with 3d grids using pymeep; I read something about a possible hybrid MPI/OpenMP support implementation in Meep; the OpenMP support in addition to MPI would help really so much in the 3d problems I am treating now, since I am typically constrained to use 1/4th of the available cores in each node before performance degradation and memory exaustion. Is such implementation already available in the nightly build of pymeep? I would really be glad to test it and report issues.
Is such implementation already available in the nightly build of pymeep
No, but you can build off of the branch in #1628 to test the preliminary features there. While you'll save on memory, there's still quite a bit of improvement needed to get the OpenMP computational scalability up to where it is currently with pure MPI.
Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included? If I have the possibility I'll try and compile it, but the demanding situation I need to test it for - large area 3d optimizations - needs me to use a HPC cluster, but I don't have root access, that's why I asked for pymeep... BTW from some job scheduling tests, my large-area 3d optimization look badly limited by local Memory Bandwith within each node, in such a way that I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node, even with local memory binding and process-to-core pinning; do you think your OpenMP implementation could help with that, if used hybrid-way with MPI? In any case thank you for the fast answer!
Does the branch in #1628 already include parallelization of the whole simulation, time-stepping included
Yes, that's the core contribution of the PR, actually. Other forms of parallelization (e.g. simulation initialization) are actually not yet implemented due to thread-safety issues.
I don't have root access
You don't need root access to compile meep. I've actually compiled meep on over a dozen HPC clusters myself (without root access), all for large-scale, 3D TO.
limited by local Memory Bandwith
The FDTD algorithm is inherently memory bound, so there's only so much you can do here (unless you want to implement some fancy updating schemes that can cache multiple steps).
I obtain best runtime performance by limiting to 10/20 processes on a 64-core(HT)//2-socket node
Choosing the proper hardware configuration is a longstanding issue, as you must carefully balance your computation and communication loads, which ultimately requires some extensive trial-and-error on the platform you are using.
do you think your OpenMP implementation could help with that
Maybe. Not with the current branch though. As I said, a lot of effort is still needed in order to achieve performance even on the same level as the current MPI build. The biggest advantage to the hybrid approach right now is reducing the number of duplicate python instances, each with its own set of variables (e.g. 3D design variables).
It would be good to revive the OpenMP work (see #228 and #1234), and in particular to support hybrid parallelism — use shared-memory threads (via OpenMP) within a node, and MPI across nodes, in a parallel cluster.
One big motivation is to save memory for the case of 3d design grids. We replicate the material grid (as well as several similar-size data structures for gradients and other vectors internal to NLopt) for every MPI process — this would allow us to have only one copy per node of a cluster.
In particular @mochen4 is now doing 3d topology optimization for 3d-printed structure, and we are finding that we are starting to run out of memory on machines with 40 cpus per node, since the 3d material grid (which can be gigabytes in size) gets replicated 40 times.
@smartalecH, did you have a branch that could be rebased here?