Closed bhazelton closed 1 year ago
Patch coverage: 100.00
% and project coverage change: +0.04
:tada:
Comparison is base (
83fa24a
) 99.45% compared to head (584a7a8
) 99.49%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@bhazelton overall the code looks good, however I'd suggest something similar to @aelanman in that I don't think the code that does the MPI sharing should be attached to any particular class that is being shared. To me, separation of concerns (or single responsibility principle) would dictate making a class/function in the mpi
module that is able to share an arbitrary object, and also maintains an internal "global" window list that can be released. Having all the MPI stuff inside the SkyModel
and BeamList
classes makes them less usable by other code.
I'm realizing that I don't understand how MPI knows which shared array is which. But it seems like it's tied to the objects, so I'm not sure if the sharing can be moved out of the objects without breaking things. The freeing can be moved into a function as suggested in @aelanman's point 1. But I think you're asking for something more @steven-murray , I think you want the sharing moved out of the object as well?
@bhazelton yeah, I'd probably have to look at the details more closely to understand what's really possible. In my mind the ideal kind of situation would be an api like the following:
from pyuvsim import mpi, SkyModel, BeamList
sky = SkyModel()
beams = BeamList()
mpi.share(sky, beams)
do_stuff(sky, beams)
mpi.free(sky, beams)
It might not be possible to do something like that -- an alternative might be to provide a key:
from pyuvsim import mpi, SkyModel, BeamList
sky = SkyModel()
beams = BeamList()
mpi.share('sky_key', sky)
mpi.share('beam_key', beams)
do_stuff(sky, beams)
mpi.free('sky_key', 'beam_key')
One could imagine having an mpi.free_all()
function as well.
@aelanman @mkolopanis I think you two are best placed to tell us what is most likely to work here. I really don't understand how the different shared arrays are being kept track of in MPI...
Maybe this is bad form, but could you have a module-level list that gets appended to every time a new window is created, then register a cleanup function at exit? e.g., in mpi.py:
_OPEN_WINDOWS = []
def shared_mem_bcast():
. . .
win = MPI.Win.Allocate_shared(nbytes, itemsize, comm=node_comm)
buf, itemsize = win.Shared_query(0)
sh_arr = np.ndarray(buffer=buf, dtype=dtype, shape=shape)
_OPEN_WINDOWS.append(win)
. . .
def _window_cleaning():
for win in _OPEN_WINDOWS:
win.Free()
atexit.register(_window_cleaning)
This leaves the responsibility of tracking open MPI windows to the mpi module, so the BeamList and SkyModelData classes don't need any modification.
@aelanman's suggestion is approximately what I was thinking of. The only catch would be if different lists of windows are required to be free'd at different times (eg. if you had two simulations running or something like that?). Then you'd need a way to keep track of which sub-list belongs to which simulation. If all you need to do is make sure everything is freed before exiting Python (and don't care about exactly when, so long as it's after the computation is done), then @aelanman's suggestion works perfectly.
Yeah this would definitely be another way to do it too. I don't suspect we have too large a user base that runs simulations over and over in like a jupyter notebook or something like that. I'm personally more prone to cleaning up after yourself when you're done using the resources but either way should work fine.
Ok, we fixed this as a group on the telecon to move the shared window tracking to the mpi
module and free it from there with an atexit
cleanup call.
Description
The main problem we were seeing in some CI builds (and on my machine) seems to be related to not properly freeing shared memory resources. This adds code to keep track of the shared memory resources so they can be properly freed at the end of the simulation (or if there are errors raised).
Other minor fixes:
Motivation and Context
Some CI builds were breaking with MPI related errors.
Types of changes
Checklist:
For all pull requests:
Bug fix checklist:
Build or continuous integration change checklist: