O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision

juliasloan25 commented 1 year ago

Purpose

We want to implement distributed matrix multiplication to enable parallel online regridding in the coupler.

Cost/Benefits/Risks

Costs:
- Developer time
Risks:
- Incorrect regridding results if implementation is incorrect
- Will be minimized by first producing a simple MWE, then more complicated implementations, and testing as we go
Benefits:
- Faster regridding

People and Personnel

Lead: @juliasloan25
Collaborators: @LenkaNovak
Reviewers: @simonbyrne @sriharshakandala

Components

There are two major steps involved in regridding:

Initialization of the sparse weight matrix a. This will be done on one process (without MPI) using TempestRemap
Construction of the target data via multiplication of the source data and weight matrix a. This will be done using MPI, and is what we must implement here

We can divide the process up into even more granular steps:

Generate weight matrix a. This will be done on one process using serial spaces. This is the only location where serial spaces will be used in this implementation.
Distribute weights to responsible processes a. Use MPI.scatterv to send info from root process to all processes (note processes may be responsible for varying numbers of weights, so we can't use scatter). b. We can't exchange a sparse matrix object type, so instead exchange 3 arrays: nonzero values, row indices, and column offsets. c. We may also need the root process to send the number of nonzero weights each process is responsible for.
Exchange source data a. At first, send all source data to all processes. b. For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process. c. Eventually, maximize the number of cases where corresponding source and target data are stored on the same process. In this case, information exchange is not needed so this will minimize exchange operations.
Perform remapping a. Multiplication on send side: calculate local row dot products and remote row dot products. b. Use ClimaComms graph context to exchange remote dot products, add these to the previously computed local sums.
Project remapped data back onto ClimaCore space?

Implementation phases

1. Initial implementation of distributed regridding - DONE

Generate the source data field on the non-distributed source space
- Distribute this source data to all processes using built-in MPI broadcast
Construct the target data field by multiplying the source data with the weight matrix at certain indices
- This will be done distributedly, as a set of local multiplications
- Global source data is available to each process (this will be optimized later on - see below)
- Target data is local to each process and we skip multiplication when global target indices don't correspond to the given process
Use unweighted DSS to broadcast redundant node values of the target data

2. Second implementation - store only local information in each process's `LinearMap`

Construct a map from global unique indices to local indices on each process.
- Simplify the implementation of remap! by converting unique indices to local indices in generate_map and storing then in LinearMap.
Using this map, construct a LinearMap on each process containing only the information relevant for that process.

3. Optimized implementation using only the necessary source data

Use "super-halo" information exchange to send source data between processes.
- On each process, use R to determine source indices with non-zero weights. Fill send buffer with the source data at these indices.
- For the first iteration, send all source data needed by any other process to all other processes.
- For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process.
- Note that when source data, target data, and the corresponding weights are all stored on the same process, no information exchange is needed. At a later time, we will optimize the solution by maximizing the amount of data in this case.

Inputs

Source data
Source and target spaces
Weight matrix produced by TempestRemap, including indices and values of non-zero weights

Results and deliverables

Functions for distributed matrix multiplication in ClimaCoreTempestRemap, and tests for these functions.

Tests include:

Tests comparing results of distributed (new) and serial (existing) regridding
Tests comparing results of distributed regridding to expected analytical output

Current status

As of ClimaCore PR #1107 (distributed regridding v1), we are able to regrid from serial spaces to distributed spaces. Note that this regridding only works when the source and target meshes are collocated.

ClimaCore PR #1192 cleans up this implementation a bit by storing local indices in the LinearMap object itself (which is constructed only once), rather than computing them in the remap! function (which gets called multiple times). Also see https://github.com/CliMA/ClimaCore.jl/issues/1195 for more information.

A concrete example of the distributed regridding is partially implemented in ClimaCore PR #1259. This has been tested on 2 processes when remapping from 2 to 3 elements, and appears correct when compare to serial regridding results. Future work could test this implementation with more than 2 processes, and with more elements than just 2 -> 3.

The next steps are to rework our implementation so that each process uses only its local information and communicated information to perform the remapping. This is different from the distributed regridding v1, which does most of the work on the root process and then broadcasts it. Some of the logic for the distributed approach using MPI can be found in the concrete example, such as performing the weight/source data multiplication on the source side and exchanging these products then recombining them on the receive side. This next implementation should allow us to be able to perform regridding from a distributed source space to a distributed target space.

Task Breakdown And Tentative Due Date

[x] Minimum working example of the initial implementation (no information exchange) with tests for correctness [7 Apr] - ClimaCore PR #1107
- [x] To do this we first need to update ClimaCoreTempestRemap's write_exodus to handle distributed topology inputs - see ClimaCore Issue #1114 and PR #1117
[x] In generate_map, convert unique indices (target_idxs) to local indices using target_global_elem_lidx. Instead of using target_global_elem_lidx to convert global to local indices in remap!, do this in generate_map and store the local indices in LinearMap. [21 Apr] - ClimaCore issue #1191, PR #1192
[x] Write out a simple concrete example (e.g. remapping from 2 to 3 elements with 4 nodes each) using distributed spaces and MPI to exchange information. This will help us understand how to use MPI functions (i.e. scatter, scatterv) for our case and then generalize this approach. [18 Aug] - ClimaCore PR #1259
- 22 Aug: Initial prototype works (as compared to serial remapping results), could use more testing and generalization (see issue for details)
[ ] Adapt the concrete example to use a weight matrix generated by TempestRemap. This will likely primarily involve index conversions. [22 Sept]
[ ] Super-halo exchange: Create a mapping from sparse indices to the neighbor pid and local index on that process of source data. Use this with our buffer struct to generate source data on the distributed space and send only the relevant data to each process, as opposed to sending all source data. [6 Oct]
[ ] On each process, store only the local components of LinearMap data (source_idxs, target_idxs, weights, row_indices). This will allow us to iterate over only the truncated (local) weights in remap!. To do this, we need to create send and receive buffers on each process and use MPI's scatterv function. [29 Sept]
- 22 August update: This is included in the prototype mentioned above and doesn't need to be a separate component of the implementation.
[ ] Super-halo exchange - implement a buffer struct containing send and receive data to generate source data on the distributed space and send all source data to all processes. Previously we have been generating the source data on a serial space. [13 Oct]
- 22 August update: The prototype already generates source data on the distributed space. The super-halo exchange is already described in a previous point.
[ ] Implement an example with a Buildkite driver [20 Oct]
[ ] Add thorough documentation - perhaps a tutorial or complete API docs [20 Oct]
Timeline delayed due to Julia OOO March 20-Apr 3
Timeline delayed due to break as of May 18

Proposed Delivery Date

20 Oct 2023

SDI Revision Log

After an attempt at the initial implementation, we realized we would need to make modifications to our approach, including an extended "super-halo" exchange instead of using the existing halo exchange setup.
- @LenkaNovak approved on 17 Feb
more granular implementation revised now that we've specified / agreed on (also with @simonbyrne ) the next explicit steps.
- @LenkaNovak approved on 16 Mar
After attempting to implement generate_map using only distributed spaces, we realized this would be quite complicated since we need to interface with TempestRemap and keep track of the global unique indices of data and weights. To get around this, we're constructing mappings between indices and will use those to index into the distributed data locally.
- @LenkaNovak approved on 14 Apr
As of 18 May, we have decided to put this project on hold for ~3 weeks. This functionality is not very urgent, and we felt that taking some time away from it to focus on more pressing issues would be worthwhile. We hope that after taking a break from this work and returning to it with fresh eyes, we'll be able to more efficiently progress on it. We will tentatively plan to return to this on 12 June (end of academic term). The timelines have been adjusted accordingly.
- @LenkaNovak approved on 22 May
@juliasloan25 added more details under Components and revised timelines to more accurately reflect our plan after returning to this project.

simonbyrne commented 1 year ago

Write a function which takes a space and gives you an object which lets you map both: a. tempest remap index (tidx) to the set of (global node index (gidx), i, j) which correspond to that tidx b. the (global node index, i, j) to tempest remap index

b. can just be stored in an Nq*Nq*Nelem array, a will need some sort of ragged array structure (or you could initially use a Dict of arrays until you figure it out.

simonbyrne commented 1 year ago

To do the full distributed remap:

Copy local -> remote values to send buffers
start communication
do local -> local matrix multiply
end communication
do remote -> local matrix multiply

LenkaNovak commented 1 year ago

Thanks for the revision, Julia! The plan looks great! It may not be a huge leap to include the data layout optimization to reduce the need for communication (i.e., setup MPI distribution so that the points from the same region of the target and source grids looked after by the same PID), but maybe let's revisit this once we have the local to local map. 🚀

LenkaNovak commented 7 months ago

@sriharshakandala I've just changed the title of this to be consistent with the OKRs. Please feel free to modify the content, once you have a chance to take this over.

LenkaNovak commented 6 months ago

@Sbozzolo here is the issue that @juliasloan25 started, and @sriharshakandala agreed to take over in this Q. It is a bit out of date, so it might be more efficient to catch up offline. Hopefully we can share the regridding infrastructure when reading files and regridding model fields! It would be great to have your thoughts on this! 🙏

Sbozzolo commented 6 months ago

Scattered thoughts:

One the differences in Land is that weights might be different for different variables because different datasets might be defined over different coordiantes (but this in not a big deal, and we don't really have to support this I think).
Right now, Land is fully using ClimaCoreTempestRemap, from generating weights (remap_weights) to applying them (apply_remap).
This is where I am working on reading input maps for Land (the PR doesn't have a description yet, but it is almost ready, with plenty of documentation in the modules). The assumptions are that we want to decouple file processing from remapping (IO can be very expensive so maybe we will do it threaded/chunked, remapping will probably be on the GPU)

Do you have a sense of what the interface will look like?

juliasloan25 commented 6 months ago

Right now, Land is fully using ClimaCoreTempestRemap, from generating weights (remap_weights) to applying them (apply_remap).

Do you have a sense of what the interface will look like?

FWIW, ClimaCoupler also currently uses CCTR to apply the weights (see apply_remap call here).

I didn't make a plan for the interface - most of the work that was done so far was prototyping to try to get something working first.

LenkaNovak commented 6 months ago

Note that hdwrite_regridfile_rll_to_cgll (which uses TR's apply_remap) should only be used for lightweight input data, for example regridding stationary or infrequently updated (e.g. monthly) files etc, like we are doing in the coupler and used to do in land). I agree that for ILAMB (or remapping model fields on the fly) this is no longer sufficient.

For remapping fields on the fly (not done in our current AMIP, but will need this of coupling with ClimaOcean), the plan is:

use the online remapping functions we implemented in ClimaCore.
generate_map depends on TR (this only happens during initialization, so using TR here should not be a performance bottleneck). The reason why we want this is that it generates conservative and consistent weights. If we use linear interpolation, e.g. for fluxes, we will break energy / mass / momentum conservation. This is less of an issue when reading boundary conditions from files (because we can't keep track of the exact conservation in those cases anyway)
remap! (i.e., the map multiplied by field) is independent of TR, and last time we measured it, it was quite performant. However, it is not currently implemented for MPI or GPU (which is what this issue was addressing). Once we have this, we can use these online regridding functions and it should be fast and conservative.

For ILAMB there are two possible pathways

1. use the online remap! function (but this would require addressing the parallel matrix multiplication sooner, rather than later). My impression is that this is not far from being done, what are your thoughts @sriharshakandala ?
1. use the linear remapping (it is non-conservative, but it shouldn't matter much in this case), which could perhaps be a bit more work? But ultimately it's up to you, @Sbozzolo, I just wanted to mention this in case it could save you some time ☺️

Sbozzolo commented 6 months ago

I think that we want to follow the same steps you outlined for remapping on the fly in Land eventually (the non-conservative remapping goes from spectral to rectangular, not the other way around at the moment).

I don't think we need this super soon. Getting to the point where we are reading everything from data while doing integersting global runs is not around the corner (but it is our target).

In this, it would be good to have a description of what we envision the capabilities of remap! are going to be. What I think I would like is something like remap(weights, input_data) -> remapped_field, with MPI/GPU compatibily and where input_data is a rectangular array read from file (e.g., surface albedo). This allows ClimaLand to spawn a different thread to keep reading input_data, and remap them when needed.

LenkaNovak commented 6 months ago

Yeah, that's more or less what the plan was. :) And good to know the priorities and more use cases for this. @sriharshakandala I'll let you drive it from here.

CliMA / ClimaCoupler.jl