Starter ideas for using xarray and Awkward Array together

jpivarski commented 1 year ago

I realize that you're collecting use-case ideas right now, but eventually you'll need implementations and here's an idea to start.

Efficiently encoding ragged data in xarray

An Awkward Array can be broken down into a set of different-length one-dimensional arrays, and xarray coordinates can all have different lengths. The data block needs to be a product of those dimensions, but what if the data block is a zero-strided array, so that it can have arbitrary shape but take no memory?

>>> shape = (3, 10000, 999999999)

>>> np.lib.stride_tricks.as_strided(np.array([np.nan]), shape, [0] * len(shape))
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]])

Here's a converter from Awkward list-type arrays (of arbitrary depth, but only lists) to xarray:

def to_xarray(array, coord_names=None, name=None, attrs=None):
    layout = ak.to_packed(array, highlevel=False)

    depth = 0
    coords = {}
    shape = []
    while True:
        if coord_names is None:
            coord_name = layout.parameter("__xarray_name__")
            if coord_name is None:
                coord_name = f"dim_{depth}"
        else:
            coord_name, coord_names = coord_names[0], coord_names[1:]
        depth += 1

        if layout.is_list:
            listoffsetarray = layout.to_ListOffsetArray64(start_at_zero=True)
            coords[coord_name] = listoffsetarray.offsets.data
        elif layout.is_numpy:
            assert len(layout.shape) == 1  # because of ak.to_packed
            coords[coord_name] = layout.data
        elif layout.is_empty:
            coords[coord_name] = np.zeros(0)
        else:
            raise TypeError(
                f"only arrays of lists (of lists) are allowed, not {layout.form.type!s}"
            )

        shape.append(len(coords[coord_name]))
        if layout.is_list:
            layout = listoffsetarray.content
        else:
            break

    fake = np.lib.stride_tricks.as_strided(np.array([np.nan]), shape, [0] * len(shape))

    return xr.DataArray(fake, coords, name=name, attrs=attrs)

The xarrays made this way don't look like normal arrays, and they shouldn't. The wall of nan is a hint that this is not a normal array.

>>> a = ak.Array([[[0.0, 1.1, 2.2]], [[], [3.3, 4.4]], [[5.5], [6.6, 7.7, 8.8, 9.9]]])
>>> a.show(type=True)
type: 3 * var * var * float64
[[[0, 1.1, 2.2]],
 [[], [3.3, 4.4]],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]

>>> x = to_xarray(a, ["first", "second", "third"])
>>> x
<xarray.DataArray (first: 4, second: 6, third: 10)>
array([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],

       [[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],

       [[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]],

       [[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]])
Coordinates:
  * first    (first) int64 0 1 3 5
  * second   (second) int64 0 3 3 5 6 10
  * third    (third) float64 0.0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9

Doing Awkward-style slices (and other methods)

Now here's an accessor that reconstructs the Awkward Array (maybe it should only be allowed to succeed if the data consist of zero-strided NaNs?). It also provides methods like __getitem__ that lift from xarray into Awkward, performs the slice, and then back to xarray.

@xr.register_dataarray_accessor("ak")
class AwkwardAccessor:
    def __init__(self, data):
        self._data = data

    def to_awkward(self, keep_names=False):
        layout = None
        for name, coord in list(self._data.coords.items())[::-1]:
            if layout is None:
                layout = ak.contents.NumpyArray(
                    np.asarray(coord),
                    parameters={"__xarray_name__": name} if keep_names else None,
                )
            else:
                layout = ak.contents.ListOffsetArray(
                    ak.index.Index(np.asarray(coord)),
                    layout,
                    parameters={"__xarray_name__": name} if keep_names else None,
                )
        return ak.Array(layout)

    def __getitem__(self, slicer):
        return to_xarray(self.to_awkward(keep_names=True)[slicer])

    def sum(self, axis=None):
        out = ak.sum(self.to_awkward(keep_names=True), axis=axis)
        if isinstance(out, ak.Array):
            return to_xarray(out)
        else:
            return out

So any slice that could have been performed on the Awkward Array,

>>> a[::2, :, [0, -1]].show()
[[[0, 2.2]],
 [[5.5, 5.5], [6.6, 9.9]]]

can now be performed on the xarray as well, as long as we go through the ak accessor:

>>> x.ak[::2, :, [0, -1]]
<xarray.DataArray (first: 3, dim_1: 4, third: 6)>
array([[[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]],

       [[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]],

       [[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]]])
Coordinates:
  * first    (first) int64 0 1 3
  * dim_1    (dim_1) int64 0 2 4 6
  * third    (third) float64 0.0 2.2 5.5 5.5 6.6 9.9

Since these conversions between xarray and Awkward would be happening frequently, it's important that they are zero-copy.

Similarly,

>>> ak.sum(a, axis=-1).show()
[[3.3],
 [0, 7.7],
 [5.5, 33]]

and

>>> x.ak.sum(axis=-1)
<xarray.DataArray (dim_0: 4, dim_1: 5)>
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 3 5
  * dim_1    (dim_1) float64 3.3 0.0 7.7 5.5 33.0

Extensions

Now I'm getting greedy again: I don't want to be limited to only lists of (lists of...) numbers, but also regular-length dimensions, nested records, missing data, and all that. Since xarrays can hold any number of different-length coordinates, maybe we can unpack arbitrary arrays:

>>> form, length, coords = ak.to_buffers(a)
>>> print(form)
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray",
        "offsets": "i64",
        "content": {
            "class": "NumpyArray",
            "primitive": "float64",
            "form_key": "node2"
        },
        "form_key": "node1"
    },
    "form_key": "node0"
}
>>> length
3
>>> coords
{
    'node0-offsets': array([0, 1, 3, 5]),
    'node1-offsets': array([ 0,  3,  3,  5,  6, 10]),
    'node2-data': array([0. , 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
}

but then the coordinate names would have less meaning. These three can be identified as nested-list dimensions, but if the array has any missing data, there would be additional "coords" for the masks, if it has any regular dimensions, there wouldn't be corresponding "coords" for those dimensions, if it has nested record fields, there would be a lot of "coords", etc.

So it's a question of how closely the "coords" needs to correspond to actual coordinates. In the previous example, xarray x had as many dimensions as the array it represented, but the lengths of those dimensions didn't have a direct relationship with the ragged array. Other than the first dimension, it can't because the array is ragged. (And if you make the first dimension be just the stopping indexes of each list, rather than fence-posts between all the lists, then converting back to an Awkward Array can't be zero-copy.)

Metadata

The example I showed above preserves the names of some of the axes by putting the xarray axis names into Awkward parameters, performing the slice, and then pulling them back out. The name "second" was lost (replaced with "dim_1" by the code written above) because it was rewritten by the [0, -1] part of the slice. Although that's a natural consequence of one layout node being replaced by another, it's probably not what we want here.

https://github.com/scikit-hep/awkward/issues/1391 is a still-open request for Awkward to handle metadata better: to preserve it and propagate it through calculations in a way that is appropriate for xarray. Actually adding that Awkward feature depends on how it will be used in conversions to and from xarray, so that PR is interdependent with this project.

Thoughts?

What's good and what's bad?

Cc: @TomNicholas, @joshmoore, who were also on the email.

milancurcic commented 1 year ago

Thanks a lot for this @jpivarski. Very helpful. I have a few questions for now. Storing each array in coordinates is smart albeit a tad awkward. But if it ends up being an implementation detail and the user gets a nice syntax, I don't see an issue with it.

A few questions for now:

Is it possible to override xarray's default accessor with xr.register_dataarray_accessor? So that instead of x.ak[...], we can just do x[...] and get the awkard-style slicing. Here I'm assuming that a DataArray would either be regular (structured), or ragged, but never both, so it's OK to lose the original accessor.
If I create a large-ish ragged xarray with your method, say a 1000 trajectories of length 2000, the nbytes attribute seems misleading:
```
In [98]: a = ak.Array([np.random.random((2000)) for n in range(1000)])
```

In [99]: x = to_xarray(a)

In [100]: x.nbytes / 1000**3 Out[100]: 16.016


Obviously this ragged DataArray is not taking up 16 gigs. Is this because `nbytes` is determined under the hood as a product of dimensions sizes times the type size.
3. Is there an easy way to "fake away" the 0-length coordinate for the user? In other words, keep it an implementation detail, but hide it in the user-facing representation of the data structure?

I'm impressed that you could even do this much. :)

jpivarski commented 1 year ago

Ideally, we'd want to write a subclass for xr.DataArray (not a problem) and have that subclass used in derived products (a problem). I found the xr.register_dataarray_accessor because I was searching for ways to write subclasses.

I'm not surprised that xarray has such a mechanism: Awkward Array also has an override mechanism (ak.behavior) that has some features in common, which arose from similar needs.

So for (1), it wouldn't be through xr.register_dataarray_accessor because that acts through a nested namespace on all xr.DataArray objects. It's not possible to have it apply to some xr.DataArray objects and not others, so it absolutely can't override something like __getitem__ on the main class. The Awkward Array override mechanism lets you write subclasses (inheritance, rather than composition, which is specifically called out in the xarray documentation as what they were trying to avoid), and then the problem becomes one of saying which arrays it applies to, since operations on arrays produce new arrays all the time. Awkward's behaviors are applied depending on which parameters an array has, which could work here, but I can see why you need this to look like an xarray on the outside, rather than vice-versa.

For (2), that's NumPy:

>>> array = np.array([[1, 2], [3, 4]])
>>> array.nbytes
32
>>> not_actually_more_memory = np.lib.stride_tricks.as_strided(array, (1000000, 1000000), (0, 8))
>>> not_actually_more_memory.nbytes
8000000000000

The nbytes parameter is just prod(array.shape) * array.itemsize, regardless of whether the strides step over the data in a contiguous way or not. In some extreme examples, it's hard to say what the value of nbytes should be:

If you use np.lib.stride_tricks.sliding_window_view, some elements of the array alias other elements of the same array. Should they be counted more than once?
If you slice a small view of a big array, should nbytes report the memory used by the big array? That memory can't be freed while the small view continues to exist, so the small view has the big array as a memory cost.

Using nbytes to ask "how much memory is this array using?" is only meaningful for directly-owned (non-view) arrays with strides that correspond to C order or Fortran order—nothing in between.

(3) which coordinate is zero-length?

The coordinates that correspond to list offsets kinda make sense (though the outermost one would make more sense if it were 1 shorter; then it would have the length of the array it represents).

The last thing that I put in coords, the floating-valued contents, could have gone in the stride-tricked array instead. That would make more sense, as the list offsets are like coordinates and the numerical content is not. A first draft of the message above did that.

For instance, to represent [[1.1, 2.2, 3.3], [], [4.4, 5.5]] with

>>> offsets = np.array([0, 3, 3, 5])
>>> content = np.array([1.1, 2.2, 3.3, 4.4, 5.5])

we could pack them like

>>> x = xr.DataArray(
...     np.lib.stride_tricks.as_strided(content, (3, 5), (0, 8)),
...     {"stops": offsets[1:],
...      "": np.lib.stride_tricks.as_strided(np.int64(0), (5,), (0,))  # need *some* coordinate of length 5
...     }
... )
>>> x
<xarray.DataArray (stops: 3, : 5)>
array([[1.1, 2.2, 3.3, 4.4, 5.5],
       [1.1, 2.2, 3.3, 4.4, 5.5],
       [1.1, 2.2, 3.3, 4.4, 5.5]])
Coordinates:
  * stops    (stops) int64 3 3 5
  *          () int64 0 0 0 0 0

That way, the data part of the xarray isn't full of nan, but the content is repeated an unnatural number of times and you still need some coordinate to represent that inner dimension.

In the above, I also sliced the offsets to be just stops, which has all the information if the offsets are constrained to start from zero, and stops has the same length as the array that it represents. However, if we want to make an Awkward Array from this (to do some slice or other operation, then convert back to xarray), we'd have to make a new buffer to prepend it with a zero or make a starts. (Awkward's two ways of doing this are ListArray, which needs starts and stops, and ListOffsetArray, which needs offsets.)

On the other hand, the original offsets with their initial zero is stored within the stops as a base: we could retrieve it as an optimization and make new buffers if it's not available.

>>> x.coords["stops"].values.base
array([0, 3, 3, 5])

So, there are a lot of different ways to go, but they each have their downsides. Despite the admonitions in the documentation, maybe it would be possible and reasonable to make a subclass of xr.DataArray. All of its methods (including __getitem__) would just have to be careful to cast their return values as the subclass. (We might need to override every public API function to ensure that xarray's private calls to methods like __getitem__ get the parent class. All of that is doable, though a lot of work up-front.) The advantage of actually subclassing is that everything can be overridden, including __repr__ to make hide the sausage-making.

Cloud-Drift / ragged-array-idioms