Allow arrays to only span certain landunits and/or related ideas that replace filters

There are various variables in the model that are only needed over certain landunits - e.g., patch-level variables only needed for crop patches, or variables only needed for urban. We could save memory by only allocating these variables over the landunits where they apply.

This should be possible, given the way memory is currently allocated for subgrid arrays: Each landunit is grouped together (rather than each gridcell being grouped together). However, note that this approach would no longer be feasible if we went back to the way this was done until a few years ago, where all landunits for gridcell 1 appeared first, then all landunits for gridcell 2, etc.; this tie-in to the current memory allocation is a possible argument against doing this.

I don't think anything would need to change in the body of science code to accommodate this: As long as code operates over filters that only include applicable points, then things should work the same if arrays are allocated with more restrictive bounds. However, care will be needed when specifying the lower bounds of arrays that are arguments to subroutines. One way to assist with this would be to add to the current variable suffix (_patch, _col, etc.) a designator specifying the first and last landunits over which this variable applies (if not landunit 1): e.g., _patch2_2 for crop-only variables, _lun7_9 for urban-only variables (landunit 7 is the first urban landunit), etc. The bounds object would be extended to have beg & end bounds for each landunit type.

One challenging part of this is handling output to history files and restart files, along with infrastructure like p2g that expects to have a full-extent array. I'm not sure off-hand how to handle that, but we could probably make this work by adding optional arguments to generic routines that give the starting and ending bounds of arrays.

I'm not sure if this is worth doing: it will take a fair amount of effort, so this only seems worthwhile if we have a lot of variables like this, and/or memory limitation is becoming a concern. This might have some small benefit in terms of time-efficiency, because our performance often seems memory-bound, but since we already order memory by landunit, I expect the time benefit to be small.

A similar redesign would be to completely rework the data structures so that a given object only holds arrays for a single landunit. For variables that exist on multiple landunits, we would have (e.g.) one object for natural veg, another one for crop, etc. Then all science routines that need to operate across multiple landunits would be called multiple times: one for each landunit over which that routine needs to operate.

I'm imagining we'd have something similar to what is proposed in #293 - having a number of high-level objects for the different landunits (if we also do #293 , then we could have multiple top-level objects - one per thread - each of which contains an object for each landunit type; and it would probably be easiest if we do #293 before this one). Then, imagine we have some science routine that needs to be done for both nat veg and crop landunits. This could be done as follows:

  do i = 1, numveg
     ltype = filter_veg(i)
     call myroutine(x_inst(ltype), y_inst(ltype))
  end do

Note that we use something filter-like here, but now it just tells you which landunit indices make up the "veg" filter – 1 and 2 in this case.

I think this would let us do away with most filters in the science code (see also #236). This would require a fairly major rewrite of a lot of code, but this could be worthwhile both to make the code more understandable and to allow vectorization.

A few more thoughts on this, some getting a bit tangential (which should maybe become their own issue(s) at some point):

In terms of averaging to subgrid level, one possibility would be: In initialization, we could set up pointers to each of the landunit-level arrays. Then when the p2g call comes, you would do something with that array of arrays.

One challenge with the idea in https://github.com/ESCOMP/CTSM/issues/829#issuecomment-594128300 (having a separate object per landunit) would be how to handle urban (and anything else) that has some columns with one set of physics and some with another (for urban, pervious roads are hydrologically active). (Maybe we could just live with conditionals for that case, or maybe we'd want to subdivide objects finer than landunits in that case, or maybe we should rework the subgrid structure as it relates to urban.)

Another challenge with this and with other ideas that involve doing away with filters is how to distinguish between active and inactive points. Running code over inactive points is inefficient, but having a conditional at the top of every loop may interfere with vectorization just as much as filters do. It may be worth revisiting the idea of having arrays only contain currently-active points, and then reallocating annually as needed (note that, for performance reasons, we would not want to allow new points to become active at any arbitrary time, but once a year seems reasonable). To do this would require a method to migrate all data from one index to another. We could do that by having data structures and routines that operate on all variables, or maybe easier, just having each data structure have routines that explicitly do this copy operation on all variables that need it. Note that, if we do this at the right time in the driver loop, we could probably do this just on variables that are needed in restart. Then we'd want to have some test that would somehow fail if we missed a variable in this operation.

I have also been thinking about how to do the rework in this issue incrementally, because I think it would be a bad idea to try to do this all at once without intermediate testing. I wonder if we could do this one data structure at a time; in the limbo period, perhaps we could have indirect indexing arrays that translate from (e.g.) an overall column index to an index within a given landunit. So we could have:

do fc = ...
   c = filterX(fc)
   c_within_l = col_to_new(li, c)

(where li is the landunit index used in this subroutine call). To facilitate this, maybe start by having calls done multiple times with filters that just operate over one landunit each.

However, thinking about how to do this incrementally makes me think that it may be best not to jump to having separate data structures for each landunit. Instead, we could keep the data structures and arrays as they currently are. We could start by getting rid of filters, operating on subsets of arrays. (I think it will be easier to do things incrementally like this.) Then we could later possibly take the next step of reworking the data structures so that each object only holds a single landunit. (Once the subroutine calls have been reworked to not use filters, and so each subroutine call operates only over a single landunit, this data structure rework should be easier to do incrementally... though I'm not sure if it would actually add significant benefit at that point.)

If we ever want to consider porting the code to GPUs or similar architectures, this compression of the arrays could be worthwhile or maybe even essential to avoid being killed by the memory transfer costs. This is similar to the benefit we'd get even on traditional architectures of not having to fetch as much from memory, but the importance is bigger on accelerators, where the memory bandwidth between the CPU and the accelerator is even more limiting.

Naive question here, but one that's maybe helpful for me to understand for context. Is there an expectation, plan, or need to make all / part of the CTSM code GPU friendly?

Sorry, not at all a naive question; I should have explained the context for my thinking a bit: I have always assumed that CTSM would NOT be a target for GPUs, partly based on my understanding that ELM (the E3SM land model) is not a target for DOE's massive GPUification effort. However, in recent conversations with @briandobbins he was wondering if the land model might provide some low-hanging fruits for GPUification given that it doesn't have any communication between processors to worry about. I think this is still very much an open question.

I think the key point here, though, is that we can potentially get a lot of different performance benefits from working on the underlying memory architecture of CTSM, though this would certainly need more investigation before saying anything definitive.

ESCOMP / CTSM

Allow arrays to only span certain landunits and/or related ideas that replace filters #829