Usability concerns and possible improvements for `MultiArch`

glwagner commented 2 years ago

First of all...

Should we change MultiArch to MultiArchitecture? It's not a big deal, but MultiArch does seem to violate our "verbosity" rules.

Some observations and questions:

Should we have a default ranks? Maybe ranks = (1, MPI.Comm_size(MPI.COMM_WORLD), 1) or something? Or is this a bad idea? I forsee boilerplate for scripts that are intended to automatically scale...
It'd be nice if we didn't have to pass the topology to both MultiArch and RectilinearGrid; the grid constructor for MultiArch can grab topology from arch.
Probably MultiArch should take a similar approach as the grid constructor; ie we can't distribute a grid in Flat directions, so ranks should be a 2-tuple for 2D topology, 1-element for 1D topology.
We might want some functions for arch (and also grid, model, sim):
- communicator(grid) returns eg MPI.COMM_WORLD
- local_rank(grid) returns eg MPI.Comm_rank(communicator(grid)) (or maybe just rank(grid), or something)
- This is maybe not our scope (belongs more to MPI.jl) but it does seem like it'd be nice to have a macro that a) prints normally for non-mpi and b) prints just from rank 0 if using MPI. That way we only have to change arch when switching from single-process runs to distributed runs.
We need Base.summary(::MultiArch) so logs don't get mutilated
How should reductions behave? Should mean over all dims reduce everything, or should we do a local reduction (as we do now?) Perhaps we want a special mean for DistributedField, and mean(interior(field)) can still be used for local reductions? We already have to redefine any reductions on Field, so it makes sense that we further extend for Field on MultiArch
ReducedField across the partition is also hard. We can reduce only locally, which could be fine for output in some cases. But if ReducedField are used in AbstractOperation we clearly need to gather and scatter for that to work. Probably the right thing is to implement gather and scatter by default, and then to add features for "local ReducedField that could maybe be used to optimize I/O performance for the biggest problems.

Note: a macro / logger manipulation that avoids "extra" logging for distributed simulations might actually be essential because Oceananigans is pretty chatty:

┌ Warning: defaulting to uniform WENO scheme with Float64 precision, use WENO5(grid = grid) if this was not intended
└ @ Oceananigans.Advection ~/Projects/Oceananigans.jl/src/Advection/weno_fifth_order.jl:144
┌ Warning: defaulting to uniform WENO scheme with Float64 precision, use WENO5(grid = grid) if this was not intended
└ @ Oceananigans.Advection ~/Projects/Oceananigans.jl/src/Advection/weno_fifth_order.jl:144
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Iteration: 0, time: 0 seconds
[ Info: Rank 1: max|ζ|: 7.80e+01, max(e): 2.46e-01
[ Info: Rank 0: max|ζ|: 7.58e+01, max(e): 2.31e-01
[ Info:     ... simulation initialization complete (9.536 seconds)
[ Info: Executing initial time step...
[ Info:     ... simulation initialization complete (8.565 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (42.937 seconds).
[ Info:     ... initial time step complete (43.012 seconds).

That's for 2 ranks. Imagine we scale to thousands of ranks...

glwagner commented 2 years ago

Can someone explain the purpose of MPI.Init()? Is it a bad idea to call this inside MultiArch constructor rather than requiring users to write it out? Can we detect if it's been called in the constructor and then call by default if it hasn't been called yet?

francispoulin commented 2 years ago

I guess changing to MultiArchitecture makes sense but I admit, I do like MultiArch since we do use arch in places and it has become pretty common, I think.

Do we need default ranks? We can always find the number of runs in any run, and why not just check? Sorry if I am missing something.

To answer the above question, you can see here, but the short version is it initializes MPI in the current process. I guess this tells each process that it should get ready for the possibility of sharing information, otherwise, we probably get an error.

It sounds cleaner to me to hide it in MultiArch but wonder if there is a good reason why not. Sadly, I don't know it but am happy to hear more from others.

johnryantaylor commented 2 years ago

Just to follow on from what @francispoulin wrote, we normally use MPI_Init() at the start and MPI_Finalize() at the end. All code that calls MPI needs to be between these statements. To your question about whether you can call MPI_Init() within MultiArch (or MultiArchitecture), I think the question is whether it would ever be useful to use an MPI routines before calling MultiArch. For example, you can't find out the local rank until you call MPI_Init(). Also, it is good practice to use both MPI_Init() and MPI_Finalize(), so if MPI_Init() is in MultiArch, it leaves the question of where MPI_Finalize() should be. This might be an argument for having them both outside of MultiArch where they are visible to the user.

francispoulin commented 2 years ago

I agree with @johnryantaylor , even though it might make for shorter code, there are good reasons for keeping MPI_Init() and MPI_Finalize(), in each script. Maybe this should be considered good practice?

glwagner commented 2 years ago

It does make sense, so we have to live with that necessary distinction for now. But lets continue to think about how to bring the syntax for distributed and single process scripts as close together as possible.

glwagner commented 2 years ago

@simone-silvestri suggests removing topology as an argument from the MultiArch constructor. MultiArch now uses topology to construct RankConnectivity (ie the mapping between ranks). But in @simone-silvestri's words "connectivity is a grid thing".

I'm not 100% sure how to solve it though. One possibility is to 1) rebuild MultiArch in the grid constructor adding connectivity information. Or we can 2) introduce a wrapper for grids on distributed domains or 3) add connectivity to all grids, set to nothing when not distributred. I think 1) is easiest but maybe not the cleanest.

glwagner commented 2 years ago

After some discussions I also think "Distributed" is a confusing name for this module. The module is really specific to "multi-process" execution with MPI, rather than distributed-memory execution (though it supports distributed memory setups). MultiRegion also has distributed memory features.

I think drawing a closer connection to MPI might help, something like:

MultiArch -> MPIArchitecture
Distributed -> MPIArchitectures

Or, if we want to be even more explicit, MPIMultiProcess or something. But I think "MPI" uniquely implies "multi-process".

simone-silvestri commented 2 years ago

(1) should be already implemented (rebuilding arch in grid constructor)

I.e., grid.architecture is necessary equal to the input architecture.

I think the way to go is specify a partitioning object. That will define how you split the grid. The connectivity itself (which core you communicate to) should go in the BC

glwagner commented 2 years ago

I guess partition would take the place of ranks.

I think it makes sense to delay building connectivity (ie we only build connectivity when we build fields and their boundary conditions) --- because we can construct connectivities if we have 1) grid and 2) the partition. But we still have to store the partition in either grid or grid.architecture, correct?

One approach could produce code like

arch = MPIArchitecture() # note no arguments
grid = RectilinearGrid(arch, grid_kw..., partition=XPartition())

internally, the constructors maybe take the form

function RectilinearGrid(arch, size, other_grid_kw, architecture_kwargs...)
    arch = rebuild_arch(arch; architecture_kwargs...)
    # etc
end

That's possibility (1).

Possibility (2) is to use a wrapper for MPI jobs, perhaps completely eliminating MPIArchitecture altogether:

arch = CPU()
grid = RectilinearGrid(arch, kw...)
grid = MultiProcessGrid(grid, partition=XPartition())

Possibilty (3) is to change all the struct definitions for grids so they have additional information pertinent to multi-process stuff, which is set to nothing when unused.

The downside to (2) is that it introduces yet another level of indirection, on top of both ImmersedBoundaryGrid and MultiRegionGrid. If we wanted to do MPI-distribued, multi-region simulations in complex domains, we have 4 (!) levels of wrapping. Hmm...

simone-silvestri commented 2 years ago

I am in favor of option (3). It will allow us embed a little more multi process in the code. On the other hand, option (1) can also just be

arch = MPIArchitecture(partition = XPartition())
grid = RectilinearGrid(arch, grid_kw...)

then connectivity will be explicited when you build boundary conditions

glwagner commented 2 years ago

It will allow us embed a little more multi process in the code.

Can you explain this point?

To me it looks like the main difference / improvement of (3) is the user interface rather than functionality, because every grid has an architecture. Thus in source code it's the difference between grid.architecture.partition or grid.partition. With a process_partition(grid) interface (which we should have...) even that distinction is lost.

The primary tradeoff against the user interface change is an increase in code complexity. New developers have to wrestle with and users have to puzzle over grid.partition, even though it's relevant only in a minority of cases (particularly because we are GPU-focused, and are providing features for multi-GPU). For example with MultiRegionGrid we'd have grid.partition and grid.region_grids[1].partition.

But maybe I am missing something?

glwagner commented 1 year ago

I'll convert this to a discussion since there are no immediate action items.

CliMA / Oceananigans.jl

Usability concerns and possible improvements for `MultiArch` #2349