Open jni opened 3 years ago
Interesting question, thanks @jni. My first thought is that DataArray
is not actually a "higher-dimensional dataframe". A key thing for dataframes is that each column can have a separate dtype, and as a result the conceptual model of a dataframe as an ordered set of 1-D columns is quite natural. A DataArray
on the other hand must have a uniform dtype, and therefore that "ordered set of columns" doesn't seem to fit well.
Unless I'm missing something, a DataArray
is much closer to an ndarray/tensor than to a dataframe - in particular for the purposes of data exchange between libraries, where we're very interested in memory layout. The conceptual model for a DataArray
I'd use is:
If there'd be multiple libraries with a DataArray
-like data structure, allowing adding the dimension names to __dlpack__
seems like a more natural step, and probably all that would be needed?
I agree that a top level DataArray
analogous to a DataFrame
doesn't make sense in the context of the DataFrame
protocol.
I interpreted the question as wanting to store n-dimensional data in a column of a DataFrame, where presumably the first dimension is equal to the number of rows in the DataFrame. This sounds very reasonable and a worthwhile extension to support in the future. This could be supported via something similar to the Arrow FixedSizeListArray: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html
This could be supported via something similar to the Arrow FixedSizeListArray
Should the two be considered conceptually similar? For example, would operations on n-dimensional FixedSizeListArrays
follow the same semantics (e.g., broadcasting) as arrays as defined by the array API standard?
I interpreted the question as wanting to store n-dimensional data in a column of a DataFrame, where presumably the first dimension is equal to the number of rows in the DataFrame
Ah I can see that as being feasible. In that case it hasn't got much to do with xarray.DataArray
anymore right? The one label that's then present is the column label, so the n-dim column just has data with a homogeneous dtype, i.e. it's a regular array?
This seems similar to numpy
record arrays where individual items can be multidimensional arrays - e.g.
>>> recarray = np.empty(10, dtype=[('x', np.int64), ('y', np.float64, (3, 4)), ('z', str)])
>>> recarray[0]
(0, [[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]], '')
>>> recarray[0][0]
0
>>> recarray[0][1]
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
>>> recarray['x'].shape
(10,)
>>> recarray['y'].shape
(10, 3, 4)
IIUC this isn't an often used feature, but it can be very powerful/expressive so I think it would be worthwhile to support if it doesn't add too much complexity.
the conceptual model of a dataframe as an ordered set of 1-D columns is quite natural
In this world, the DataFrame would instead be an ordered/labelled set of arrays, each with a homogeneous dtype - i.e. you'd just drop the 1D requirement.
@rgommers @kkraus14 note that I proposed that DataArrays would be conceptually equivalent to columns. An xarray Dataset would be equivalent to a Dataframe. ie, as @dhirschfeld notes, we are merely dropping the 1D requirement of a column, everything else remains the same.
However, I'm not familiar enough with xarray indexing semantics to understand further implications, e.g. do indices now have to have as many dimensions as the highest-dimensional DataArray in the Dataset?
Should the two be considered conceptually similar? For example, would operations on n-dimensional
FixedSizeListArrays
follow the same semantics (e.g., broadcasting) as arrays as defined by the array API standard?
I would argue no, with the caveat that we should enable going to array libraries zero copy if possible in these situations. I think the fact that we can have nulls at any level of the array makes them different enough where we shouldn't implement broadcasting. Additionally, primitive typed columns without nulls are functionally equivalent to a 1d-array and broadcasting isn't supported on them.
Sounds like the request is for n-dimensional columns, which sounds reasonable and in scope for the project once we start tackling nested types more generally.
@jni I would also be interested in this. scipp relies heavily on a DataArray
type (similar to xarray.DataArray
but with some differences). From my point of view it is hard to define Dataset
/DataFrame
directly based on arrays. I feel there is (at least) one intermediate conceptual level:
var['temperature', 4]
.Without items 2.) and 3.) there is a big gap to bridge between 1.) and 4.), which is probably ok for the pandas-style DataFrame, but might limit the usefulness of a standard for non-1-d applications.
Have items 2.) and 3.) been discussed anywhere? I am a bit late to the party and try to catch up with some reading...
This is not even half-baked, but I wanted to gauge interest/feasibility for the spec to encapsulate n-dimensional "columns" of data, equivalent to xarray's DataArrays. In that case, the currently-envisioned columns would be the 1D specific case of a higher-D general case. We've found that in some use cases we need these in napari (napari/napari#2592, napari/napari#2917), and it would be awesome to conform to the dataframe API and be compatible with both xarray and pandas.
Of course the other way around this is to ignore the higher-D libraries, and have them conform to the API once it's settled. That might be more reasonable, in which case, I'm perfectly happy for this to be closed. 😊