Open jpivarski opened 1 year ago
Thanks a lot for this @jpivarski. Very helpful. I have a few questions for now. Storing each array in coordinates is smart albeit a tad awkward. But if it ends up being an implementation detail and the user gets a nice syntax, I don't see an issue with it.
A few questions for now:
xr.register_dataarray_accessor
? So that instead of x.ak[...]
, we can just do x[...]
and get the awkard-style slicing. Here I'm assuming that a DataArray would either be regular (structured), or ragged, but never both, so it's OK to lose the original accessor.nbytes
attribute seems misleading:
In [98]: a = ak.Array([np.random.random((2000)) for n in range(1000)])
In [99]: x = to_xarray(a)
In [100]: x.nbytes / 1000**3 Out[100]: 16.016
Obviously this ragged DataArray is not taking up 16 gigs. Is this because `nbytes` is determined under the hood as a product of dimensions sizes times the type size.
3. Is there an easy way to "fake away" the 0-length coordinate for the user? In other words, keep it an implementation detail, but hide it in the user-facing representation of the data structure?
I'm impressed that you could even do this much. :)
Ideally, we'd want to write a subclass for xr.DataArray
(not a problem) and have that subclass used in derived products (a problem). I found the xr.register_dataarray_accessor
because I was searching for ways to write subclasses.
I'm not surprised that xarray has such a mechanism: Awkward Array also has an override mechanism (ak.behavior) that has some features in common, which arose from similar needs.
So for (1), it wouldn't be through xr.register_dataarray_accessor
because that acts through a nested namespace on all xr.DataArray
objects. It's not possible to have it apply to some xr.DataArray
objects and not others, so it absolutely can't override something like __getitem__
on the main class. The Awkward Array override mechanism lets you write subclasses (inheritance, rather than composition, which is specifically called out in the xarray documentation as what they were trying to avoid), and then the problem becomes one of saying which arrays it applies to, since operations on arrays produce new arrays all the time. Awkward's behaviors are applied depending on which parameters
an array has, which could work here, but I can see why you need this to look like an xarray on the outside, rather than vice-versa.
For (2), that's NumPy:
>>> array = np.array([[1, 2], [3, 4]])
>>> array.nbytes
32
>>> not_actually_more_memory = np.lib.stride_tricks.as_strided(array, (1000000, 1000000), (0, 8))
>>> not_actually_more_memory.nbytes
8000000000000
The nbytes
parameter is just prod(array.shape) * array.itemsize
, regardless of whether the strides
step over the data in a contiguous way or not. In some extreme examples, it's hard to say what the value of nbytes
should be:
nbytes
report the memory used by the big array? That memory can't be freed while the small view continues to exist, so the small view has the big array as a memory cost.Using nbytes
to ask "how much memory is this array using?" is only meaningful for directly-owned (non-view) arrays with strides
that correspond to C order or Fortran order—nothing in between.
(3) which coordinate is zero-length?
The coordinates that correspond to list offsets kinda make sense (though the outermost one would make more sense if it were 1 shorter; then it would have the length of the array it represents).
The last thing that I put in coords
, the floating-valued contents, could have gone in the stride-tricked array instead. That would make more sense, as the list offsets are like coordinates and the numerical content is not. A first draft of the message above did that.
For instance, to represent [[1.1, 2.2, 3.3], [], [4.4, 5.5]]
with
>>> offsets = np.array([0, 3, 3, 5])
>>> content = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
we could pack them like
>>> x = xr.DataArray(
... np.lib.stride_tricks.as_strided(content, (3, 5), (0, 8)),
... {"stops": offsets[1:],
... "": np.lib.stride_tricks.as_strided(np.int64(0), (5,), (0,)) # need *some* coordinate of length 5
... }
... )
>>> x
<xarray.DataArray (stops: 3, : 5)>
array([[1.1, 2.2, 3.3, 4.4, 5.5],
[1.1, 2.2, 3.3, 4.4, 5.5],
[1.1, 2.2, 3.3, 4.4, 5.5]])
Coordinates:
* stops (stops) int64 3 3 5
* () int64 0 0 0 0 0
That way, the data part of the xarray isn't full of nan
, but the content is repeated an unnatural number of times and you still need some coordinate to represent that inner dimension.
In the above, I also sliced the offsets
to be just stops
, which has all the information if the offsets
are constrained to start from zero, and stops
has the same length as the array that it represents. However, if we want to make an Awkward Array from this (to do some slice or other operation, then convert back to xarray), we'd have to make a new buffer to prepend it with a zero or make a starts
. (Awkward's two ways of doing this are ListArray, which needs starts
and stops
, and ListOffsetArray, which needs offsets
.)
On the other hand, the original offsets
with their initial zero is stored within the stops as a base
: we could retrieve it as an optimization and make new buffers if it's not available.
>>> x.coords["stops"].values.base
array([0, 3, 3, 5])
So, there are a lot of different ways to go, but they each have their downsides. Despite the admonitions in the documentation, maybe it would be possible and reasonable to make a subclass of xr.DataArray
. All of its methods (including __getitem__
) would just have to be careful to cast their return values as the subclass. (We might need to override every public API function to ensure that xarray's private calls to methods like __getitem__
get the parent class. All of that is doable, though a lot of work up-front.) The advantage of actually subclassing is that everything can be overridden, including __repr__
to make hide the sausage-making.
I realize that you're collecting use-case ideas right now, but eventually you'll need implementations and here's an idea to start.
Efficiently encoding ragged data in xarray
An Awkward Array can be broken down into a set of different-length one-dimensional arrays, and xarray coordinates can all have different lengths. The data block needs to be a product of those dimensions, but what if the data block is a zero-strided array, so that it can have arbitrary shape but take no memory?
Here's a converter from Awkward list-type arrays (of arbitrary depth, but only lists) to xarray:
The xarrays made this way don't look like normal arrays, and they shouldn't. The wall of
nan
is a hint that this is not a normal array.Doing Awkward-style slices (and other methods)
Now here's an accessor that reconstructs the Awkward Array (maybe it should only be allowed to succeed if the data consist of zero-strided NaNs?). It also provides methods like
__getitem__
that lift from xarray into Awkward, performs the slice, and then back to xarray.So any slice that could have been performed on the Awkward Array,
can now be performed on the xarray as well, as long as we go through the
ak
accessor:Since these conversions between xarray and Awkward would be happening frequently, it's important that they are zero-copy.
Similarly,
and
Extensions
Now I'm getting greedy again: I don't want to be limited to only lists of (lists of...) numbers, but also regular-length dimensions, nested records, missing data, and all that. Since xarrays can hold any number of different-length coordinates, maybe we can unpack arbitrary arrays:
but then the coordinate names would have less meaning. These three can be identified as nested-list dimensions, but if the array has any missing data, there would be additional "
coords
" for the masks, if it has any regular dimensions, there wouldn't be corresponding "coords
" for those dimensions, if it has nested record fields, there would be a lot of "coords
", etc.So it's a question of how closely the "
coords
" needs to correspond to actual coordinates. In the previous example, xarrayx
had as many dimensions as the array it represented, but the lengths of those dimensions didn't have a direct relationship with the ragged array. Other than the first dimension, it can't because the array is ragged. (And if you make the first dimension be just the stopping indexes of each list, rather than fence-posts between all the lists, then converting back to an Awkward Array can't be zero-copy.)Metadata
The example I showed above preserves the names of some of the axes by putting the xarray
axis
names into Awkwardparameters
, performing the slice, and then pulling them back out. The name"second"
was lost (replaced with"dim_1"
by the code written above) because it was rewritten by the[0, -1]
part of the slice. Although that's a natural consequence of one layout node being replaced by another, it's probably not what we want here.https://github.com/scikit-hep/awkward/issues/1391 is a still-open request for Awkward to handle metadata better: to preserve it and propagate it through calculations in a way that is appropriate for xarray. Actually adding that Awkward feature depends on how it will be used in conversions to and from xarray, so that PR is interdependent with this project.
Thoughts?
What's good and what's bad?
Cc: @TomNicholas, @joshmoore, who were also on the email.