firedrakeproject / asQ

A library for implementing ParaDIAG timestepping algorithms using Firedrake
MIT License
4 stars 1 forks source link

`ManualEnsemble` class for specifying all comms in an `Ensemble` #189

Open JHopeCollins opened 4 months ago

JHopeCollins commented 4 months ago

This PR slightly changes the way we split a large Ensemble into multiple smaller Ensembles, either in the SliceJacobiPC or in the nonlinear Gauss-Seidel iterations. firedrake.Ensemble essentially has two responsibilities:

  1. Make sure we have three comms on each rank, a global_comm, a spatial_comm, and an ensemble_comm. Currently, firedrake.Ensemble does this by taking a global_comm and splitting it into a Cartesian product of comms, providing a spatial_comm and an ensemble_comm from this product.
  2. Wrap mpi4py calls so that we can send firedrake.Functions from one spatial_comm to another across the ensemble_comm with a simple API and some sanity checks.

Creating a split ensemble involves intercepting the logic in 1 to make sure the comms in the split ensemble relate properly to the comms in the larger ensemble. Specifically, the split ensemble needs three communicators:

  1. A global_comm split from the global_comm of the larger ensemble
  2. An ensemble_comm split from the global_comm or ensemble_comm of the larger ensemble
  3. A spatial_comm that is the same as the spatial_comm from the larger ensemble so we can use the same mesh with both Ensembles.

The main issue here is that we need the spatial_comm of the split ensemble to be the same comm as the spatial_comm of the large ensemble, not just congruent. This means that we can't just make the smaller global_comm for the split ensemble and reuse the existing firedrake.Ensemble.__init__.

Previously I made a new EnsembleConnector class (terrible name, it connects existing spatial_comms, it doesn't connect different Ensembles). This class inherited from firedrake.Ensemble but overrode __init__, taking a global_comm and a specific spatial_comm, then created a new ensemble_comm by splitting the provided global_comm. To go with this is a split_ensemble function that takes in a large ensemble, splits it's global_comm, and passes the split global_comm and the spatial_comm to the new EnsembleConnector. This works fine for our case but has a couple of issues (other than the naming issues that already plague Ensemble).

  1. The main issue is that it mixes up responsibilities. The split_ensemble function does some, but not all of task 1, making sure we have three viable comms. It sorts out the global_comm and spatial_comm but not the ensemble_comm, which is left to the EnsembleConnector.
  2. It is specific to the case of already having the spatial_comm but not the ensemble_comm (what about the case where I already have the ensemble_comm but not the spatial_comm, or already have both).

This PR changes to having a ManualEnsemble class that inherits from firedrake.Ensemble but just takes three comms and checks that they look like a global/spatial/ensemble comm set (i.e. they look like a cartesian product of comms). It essentially is only doing task 2, wrapping mpi4py calls, and trusts that the three provided comms are a valid set to use. The split_ensemble function now does all of task 1, taking in a larger Ensemble, splitting the global_comm and the ensemble_comm, and passing these plus the original spatial_comm to ManualEnsemble.

ManualEnsemble is simpler and more general than EnsembleConnector was (also more of a footgun, but I've tried to add enough checks), and split_ensemble now takes care of all of the logic of splitting an Ensemble, rather than just some of it.

JHopeCollins commented 4 months ago

@JDBetteridge this PR only involves the split_ensemble function and the ManualEnsemble class, the create_ensemble is just a convenience function not used here.