Optimization: use (more of) subspace reduce

to save memory, we could move from sparse grid reduce to subspace reduce, cf. https://ebooks.iospress.nl/doi/10.3233/978-1-61499-381-0-564

this was implemented in the past for

small setups
without parallelization within the grids / without process groups

To do it with process groups, one could "vote" on each process groups which subspaces are present. A set of process groups that shares the same subspaces (and all others don't) could have its own reduce-communicator. the reduction would iterate all reduce-communicators.

A challenge we have already identified:

The (potential) number of communicators can become excessively large: O(Process group size * 2^(number of subspaces)) . Process group size can be up to 2^13- 2^14 for current scenarios. number of subspaces can be in the 100,000s, and 2^ comes from the power set.

-> we are not sure if created communicators use memory on the ranks they contain only, or if the info is collected globally somewhere. how do implementations do it?

-> could maybe be remedied by a good scenario splitting, where partitions = process groups in https://github.com/SGpp/DisCoTec-combischeme-utilities . then, there should be many groups that share the exact same set of subspaces.

-> there could be a trade-off between sparse grid reduce and subspace reduce (if only some subspaces are allocated in addition to the ones that are strictly required)

SGpp / DisCoTec

Optimization: use (more of) subspace reduce #81