Format of the location AggregationInstruction term

nmassey001 commented 2 years ago

I struggled a bit with implementing the location AggregationInstruction as a variable in a netCDF file. Can we remember why we decided on the current format? Here's an example (example 3)

    dimensions:
      // Extra dimensions
      i = 4 ;
      j = 2 ;
    variables:
      // Aggregation definition variables
      int location(i, j) ;
    data:
      location = 6, 6,
                 1, _,
                 73, _,
                 144, _ ;

Here, the i and j dimensions depend on how many dimensions the variable has and the maximum number of fragments across each dimension.

From an implementation point of view it makes more sense to have location follow the dimensions for the other AggregationInstructions, with an extra dimension (k) that is always 2. Then the location consists of actual locations (rather than spans) and the file contains the information you need to index the parent array. The current method requires the parser to build the information required to actually index the parent array. I'm also a bit concerned that the current method does not really work for sparse arrays. Here is how I'd change the specification:

    dimensions:
      // Fragment dimensions
      f_time = 2 ;
      f_level = 1 ;
      f_latitude = 1 ;
      f_longitude = 1 ;
      // Extra dimensions
      k = 2;
    variables:
      // Aggregation definition variables
      int location(f_time, f_level, f_latitude, f_longitude, k) ;
    data:
      location = 0, 6,
                 0, 1,
                 0, 73,
                 0, 144,

                 6, 12,
                 0, 1,
                 0, 73,
                 0, 144 ;

The main downside is that the representation is not as compact, but the CFA Aggregation Files are going to be very small compared to the Aggregated data anyway.

Let me know what you think @davidhassell , @bnlawrence , @JonathanGregory and @sadielbartholomew.

bnlawrence commented 2 years ago

Well, to be honest, I think this is more immediately comprehensible than the other version!

davidhassell commented 2 years ago

Hi Neil,

The discussion on the current encoding can be found at the discussion starting at https://github.com/NCAS-CMS/cfa-conventions/pull/20#pullrequestreview-699916034. I'm firmly in the "keep what what we've got" camp. I'll try to explain ...

Storing the fragment sizes is logically equivalent to storing the start/end indices, has much less redundancy (we don't need to encode 0, 144 twice), and we don't need to introduce yet another non-physical dimension (k). It also uses half as much disk space, and therefore requires less data to be read. As you say, this is small relative to the full dataset, but for a case with a 1.5 million of fragments (a real use-case), this saving will be O(6 MB) assuming we're using 32-bit ints, O(12 MB) if 64-bit - is that significant?

Storing the sizes also removes the the need to define whether our indices start from one or zero - whichever we would choose would be arbitrary and would be sure to upset someone. In your example, you also use the Python (and other languages?) practice of the encoded end index being one larger than the actual end index (i.e. 0, 6 as opposed to 0, 5), which can be convenient but is not ideal for language-agnostic conventions.

As I can think of no logical benefit to storing the start/end indices, as opposed to the sizes, and there are abstract reasons to prefer the latter, I think that we should keep the original formulation of the conventions rather than change it to make an implementation easier.

Wait, I hear you say that the new suggestion is more readable - but not to me! Coming from the dask world, sizes seem very natural to me, as that is exactly how dask stores its chunksizes:

>>> import numpy as np
>>> import dask.array as da
>>> a = np.arange(12*73*144).reshape(12, 1, 73, 144)
>>> x = da.from_array(a, chunks=(6, 1, 73, 144))       # dask array similar to the above example
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> x.numblocks
(2, 1, 1, 1)

I'm not using the dask implementation to support my reasons above for preferring sizes, just to show how I'm used to using them, and so am comfortable with them. If dask used start/end indices, I'm sure I'd tend to click with them better (as I used to). Even in that case, I'd still prefer sizes in the CFA conventions, for the arguments put forward in #20 and here.

In Python, the code to convert from sizes ([6, 1, 73, 144]) to indices ([(6, 12), (0, 1), (0, 73), (0, 144)]) is essentially just a walk-through of the cumulative sums of the sizes along each dimension:

>>> from dask.utils import cached_cumsum
>>> from itertools import product
>>> cached_cumsum([1, 2, 3, 4, 5], initial_zero=True)  # test cumsum ... all good
(0, 1, 3, 6, 10, 15)
>>> x.chunks
((6, 6), (1,), (73,), (144,))
>>> cumdims = [cached_cumsum(bds, initial_zero=True) for bds in x.chunks]
>>> cumdims
[(0, 6, 12), (0, 1), (0, 73), (0, 144)]
>>> indices = [[(s, s + dim) for s, dim in zip(starts, shapes)] for starts, shapes in zip(cumdims, x.chunks)]
>>> indices
[[(0, 6), (6, 12)], [(0, 1)], [(0, 73)], [(0, 144)]]
>>> for index in product(*indices):
...     print(index)
...
((0, 6), (0, 1), (0, 73), (0, 144))
((6, 12), (0, 1), (0, 73), (0, 144))

I know that it's probably a lot easier to write this in Python rather than C :) but it seems like a succinct and usable procedure.

Cheers, David

nmassey001 commented 2 years ago

Thanks for the reply @davidhassell. In the review I did write:

I'm all for minimal and compact representations, as you probably know, so I'd be happy with this change

I'm happy to stick with the current way of doing it, as I've already implemented it and now had confirmation from you that:

Reconstructing the indices is an iterative procedure
It is expected that the indices will need to be reconstructed before doing anything useful

The question remains as to how we specify very sparse arrays?

NCAS-CMS / cfa-conventions

Format of the location AggregationInstruction term #38