CliMA / ClimaCoupler.jl

ClimaCoupler: bringing atmosphere, land, and ocean together
Apache License 2.0
25 stars 3 forks source link

O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision #188

Open juliasloan25 opened 1 year ago

juliasloan25 commented 1 year ago

Purpose

We want to implement distributed matrix multiplication to enable parallel online regridding in the coupler.

Cost/Benefits/Risks

People and Personnel

Components

There are two major steps involved in regridding:

  1. Initialization of the sparse weight matrix a. This will be done on one process (without MPI) using TempestRemap
  2. Construction of the target data via multiplication of the source data and weight matrix a. This will be done using MPI, and is what we must implement here

We can divide the process up into even more granular steps:

  1. Generate weight matrix a. This will be done on one process using serial spaces. This is the only location where serial spaces will be used in this implementation.
  2. Distribute weights to responsible processes a. Use MPI.scatterv to send info from root process to all processes (note processes may be responsible for varying numbers of weights, so we can't use scatter). b. We can't exchange a sparse matrix object type, so instead exchange 3 arrays: nonzero values, row indices, and column offsets. c. We may also need the root process to send the number of nonzero weights each process is responsible for.
  3. Exchange source data a. At first, send all source data to all processes. b. For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process. c. Eventually, maximize the number of cases where corresponding source and target data are stored on the same process. In this case, information exchange is not needed so this will minimize exchange operations.
  4. Perform remapping a. Multiplication on send side: calculate local row dot products and remote row dot products. b. Use ClimaComms graph context to exchange remote dot products, add these to the previously computed local sums.
  5. Project remapped data back onto ClimaCore space?

Implementation phases

1. Initial implementation of distributed regridding - DONE

2. Second implementation - store only local information in each process's LinearMap

3. Optimized implementation using only the necessary source data

Inputs

Results and deliverables

Functions for distributed matrix multiplication in ClimaCoreTempestRemap, and tests for these functions.

Tests include:

Current status

As of ClimaCore PR #1107 (distributed regridding v1), we are able to regrid from serial spaces to distributed spaces. Note that this regridding only works when the source and target meshes are collocated.

ClimaCore PR #1192 cleans up this implementation a bit by storing local indices in the LinearMap object itself (which is constructed only once), rather than computing them in the remap! function (which gets called multiple times). Also see https://github.com/CliMA/ClimaCore.jl/issues/1195 for more information.

A concrete example of the distributed regridding is partially implemented in ClimaCore PR #1259. This has been tested on 2 processes when remapping from 2 to 3 elements, and appears correct when compare to serial regridding results. Future work could test this implementation with more than 2 processes, and with more elements than just 2 -> 3.

The next steps are to rework our implementation so that each process uses only its local information and communicated information to perform the remapping. This is different from the distributed regridding v1, which does most of the work on the root process and then broadcasts it. Some of the logic for the distributed approach using MPI can be found in the concrete example, such as performing the weight/source data multiplication on the source side and exchanging these products then recombining them on the receive side. This next implementation should allow us to be able to perform regridding from a distributed source space to a distributed target space.

Task Breakdown And Tentative Due Date

SDI Revision Log

simonbyrne commented 1 year ago

Write a function which takes a space and gives you an object which lets you map both: a. tempest remap index (tidx) to the set of (global node index (gidx), i, j) which correspond to that tidx b. the (global node index, i, j) to tempest remap index

b. can just be stored in an Nq*Nq*Nelem array, a will need some sort of ragged array structure (or you could initially use a Dict of arrays until you figure it out.

simonbyrne commented 1 year ago

To do the full distributed remap:

  1. Copy local -> remote values to send buffers
  2. start communication
  3. do local -> local matrix multiply
  4. end communication
  5. do remote -> local matrix multiply
LenkaNovak commented 1 year ago

Thanks for the revision, Julia! The plan looks great! It may not be a huge leap to include the data layout optimization to reduce the need for communication (i.e., setup MPI distribution so that the points from the same region of the target and source grids looked after by the same PID), but maybe let's revisit this once we have the local to local map. 🚀

LenkaNovak commented 7 months ago

@sriharshakandala I've just changed the title of this to be consistent with the OKRs. Please feel free to modify the content, once you have a chance to take this over.

LenkaNovak commented 6 months ago

@Sbozzolo here is the issue that @juliasloan25 started, and @sriharshakandala agreed to take over in this Q. It is a bit out of date, so it might be more efficient to catch up offline. Hopefully we can share the regridding infrastructure when reading files and regridding model fields! It would be great to have your thoughts on this! 🙏

Sbozzolo commented 6 months ago

Scattered thoughts:

Do you have a sense of what the interface will look like?

juliasloan25 commented 6 months ago
  • Right now, Land is fully using ClimaCoreTempestRemap, from generating weights (remap_weights) to applying them (apply_remap).

Do you have a sense of what the interface will look like?

FWIW, ClimaCoupler also currently uses CCTR to apply the weights (see apply_remap call here).

I didn't make a plan for the interface - most of the work that was done so far was prototyping to try to get something working first.

LenkaNovak commented 6 months ago

Note that hdwrite_regridfile_rll_to_cgll (which uses TR's apply_remap) should only be used for lightweight input data, for example regridding stationary or infrequently updated (e.g. monthly) files etc, like we are doing in the coupler and used to do in land). I agree that for ILAMB (or remapping model fields on the fly) this is no longer sufficient.

For remapping fields on the fly (not done in our current AMIP, but will need this of coupling with ClimaOcean), the plan is:

For ILAMB there are two possible pathways

Sbozzolo commented 6 months ago

I think that we want to follow the same steps you outlined for remapping on the fly in Land eventually (the non-conservative remapping goes from spectral to rectangular, not the other way around at the moment).

I don't think we need this super soon. Getting to the point where we are reading everything from data while doing integersting global runs is not around the corner (but it is our target).

In this, it would be good to have a description of what we envision the capabilities of remap! are going to be. What I think I would like is something like remap(weights, input_data) -> remapped_field, with MPI/GPU compatibily and where input_data is a rectangular array read from file (e.g., surface albedo). This allows ClimaLand to spawn a different thread to keep reading input_data, and remap them when needed.

LenkaNovak commented 6 months ago

Yeah, that's more or less what the plan was. :) And good to know the priorities and more use cases for this. @sriharshakandala I'll let you drive it from here.