Closed glwagner closed 1 year ago
Can someone explain the purpose of MPI.Init()
? Is it a bad idea to call this inside MultiArch
constructor rather than requiring users to write it out? Can we detect if it's been called in the constructor and then call by default if it hasn't been called yet?
I guess changing to MultiArchitecture
makes sense but I admit, I do like MultiArch
since we do use arch
in places and it has become pretty common, I think.
Do we need default ranks? We can always find the number of runs in any run, and why not just check? Sorry if I am missing something.
To answer the above question, you can see here, but the short version is it initializes MPI in the current process. I guess this tells each process that it should get ready for the possibility of sharing information, otherwise, we probably get an error.
It sounds cleaner to me to hide it in MultiArch
but wonder if there is a good reason why not. Sadly, I don't know it but am happy to hear more from others.
Just to follow on from what @francispoulin wrote, we normally use MPI_Init() at the start and MPI_Finalize() at the end. All code that calls MPI needs to be between these statements. To your question about whether you can call MPI_Init() within MultiArch (or MultiArchitecture), I think the question is whether it would ever be useful to use an MPI routines before calling MultiArch. For example, you can't find out the local rank until you call MPI_Init(). Also, it is good practice to use both MPI_Init() and MPI_Finalize(), so if MPI_Init() is in MultiArch, it leaves the question of where MPI_Finalize() should be. This might be an argument for having them both outside of MultiArch where they are visible to the user.
I agree with @johnryantaylor , even though it might make for shorter code, there are good reasons for keeping MPI_Init()
and MPI_Finalize()
, in each script. Maybe this should be considered good practice?
It does make sense, so we have to live with that necessary distinction for now. But lets continue to think about how to bring the syntax for distributed and single process scripts as close together as possible.
@simone-silvestri suggests removing topology
as an argument from the MultiArch
constructor. MultiArch
now uses topology
to construct RankConnectivity
(ie the mapping between ranks). But in @simone-silvestri's words "connectivity is a grid thing".
I'm not 100% sure how to solve it though. One possibility is to 1) rebuild MultiArch
in the grid constructor adding connectivity information. Or we can 2) introduce a wrapper for grids on distributed domains or 3) add connectivity
to all grids, set to nothing
when not distributred. I think 1) is easiest but maybe not the cleanest.
After some discussions I also think "Distributed" is a confusing name for this module. The module is really specific to "multi-process" execution with MPI, rather than distributed-memory execution (though it supports distributed memory setups). MultiRegion
also has distributed memory features.
I think drawing a closer connection to MPI might help, something like:
MultiArch
-> MPIArchitecture
Distributed
-> MPIArchitectures
Or, if we want to be even more explicit, MPIMultiProcess
or something. But I think "MPI" uniquely implies "multi-process".
(1) should be already implemented (rebuilding arch in grid constructor)
I.e., grid.architecture is necessary equal to the input architecture.
I think the way to go is specify a partitioning object. That will define how you split the grid. The connectivity itself (which core you communicate to) should go in the BC
I guess partition
would take the place of ranks
.
I think it makes sense to delay building connectivity
(ie we only build connectivity when we build fields and their boundary conditions) --- because we can construct connectivities if we have 1) grid and 2) the partition. But we still have to store the partition in either grid
or grid.architecture
, correct?
One approach could produce code like
arch = MPIArchitecture() # note no arguments
grid = RectilinearGrid(arch, grid_kw..., partition=XPartition())
internally, the constructors maybe take the form
function RectilinearGrid(arch, size, other_grid_kw, architecture_kwargs...)
arch = rebuild_arch(arch; architecture_kwargs...)
# etc
end
That's possibility (1).
Possibility (2) is to use a wrapper for MPI jobs, perhaps completely eliminating MPIArchitecture
altogether:
arch = CPU()
grid = RectilinearGrid(arch, kw...)
grid = MultiProcessGrid(grid, partition=XPartition())
Possibilty (3) is to change all the struct
definitions for grids so they have additional information pertinent to multi-process stuff, which is set to nothing
when unused.
The downside to (2) is that it introduces yet another level of indirection, on top of both ImmersedBoundaryGrid
and MultiRegionGrid
. If we wanted to do MPI-distribued, multi-region simulations in complex domains, we have 4 (!) levels of wrapping. Hmm...
I am in favor of option (3). It will allow us embed a little more multi process in the code. On the other hand, option (1) can also just be
arch = MPIArchitecture(partition = XPartition())
grid = RectilinearGrid(arch, grid_kw...)
then connectivity will be explicited when you build boundary conditions
It will allow us embed a little more multi process in the code.
Can you explain this point?
To me it looks like the main difference / improvement of (3) is the user interface rather than functionality, because every grid has an architecture
. Thus in source code it's the difference between grid.architecture.partition
or grid.partition
. With a process_partition(grid)
interface (which we should have...) even that distinction is lost.
The primary tradeoff against the user interface change is an increase in code complexity. New developers have to wrestle with and users have to puzzle over grid.partition
, even though it's relevant only in a minority of cases (particularly because we are GPU-focused, and are providing features for multi-GPU). For example with MultiRegionGrid
we'd have grid.partition
and grid.region_grids[1].partition
.
But maybe I am missing something?
I'll convert this to a discussion since there are no immediate action items.
First of all...
MultiArch
toMultiArchitecture
? It's not a big deal, butMultiArch
does seem to violate our "verbosity" rules.Some observations and questions:
ranks
? Mayberanks = (1, MPI.Comm_size(MPI.COMM_WORLD), 1)
or something? Or is this a bad idea? I forsee boilerplate for scripts that are intended to automatically scale...MultiArch
andRectilinearGrid
; the grid constructor forMultiArch
can grab topology fromarch
.MultiArch
should take a similar approach as the grid constructor; ie we can't distribute a grid inFlat
directions, soranks
should be a 2-tuple for 2D topology, 1-element for 1D topology.arch
(and alsogrid
,model
,sim
):communicator(grid)
returns egMPI.COMM_WORLD
local_rank(grid)
returns egMPI.Comm_rank(communicator(grid))
(or maybe justrank(grid)
, or something)MPI.jl
) but it does seem like it'd be nice to have a macro that a) prints normally for non-mpi and b) prints just from rank 0 if using MPI. That way we only have to changearch
when switching from single-process runs to distributed runs.Base.summary(::MultiArch)
so logs don't get mutilatedmean
over all dims reduce everything, or should we do a local reduction (as we do now?) Perhaps we want a specialmean
forDistributedField
, andmean(interior(field))
can still be used for local reductions? We already have to redefine any reductions onField
, so it makes sense that we further extend forField
onMultiArch
ReducedField
across the partition is also hard. We can reduce only locally, which could be fine for output in some cases. But ifReducedField
are used inAbstractOperation
we clearly need to gather and scatter for that to work. Probably the right thing is to implement gather and scatter by default, and then to add features for "localReducedField
that could maybe be used to optimize I/O performance for the biggest problems.Note: a macro / logger manipulation that avoids "extra" logging for distributed simulations might actually be essential because Oceananigans is pretty chatty:
That's for 2 ranks. Imagine we scale to thousands of ranks...