data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
102 stars 20 forks source link

Higher-dimensional "columns" #59

Open jni opened 3 years ago

jni commented 3 years ago

This is not even half-baked, but I wanted to gauge interest/feasibility for the spec to encapsulate n-dimensional "columns" of data, equivalent to xarray's DataArrays. In that case, the currently-envisioned columns would be the 1D specific case of a higher-D general case. We've found that in some use cases we need these in napari (napari/napari#2592, napari/napari#2917), and it would be awesome to conform to the dataframe API and be compatible with both xarray and pandas.

Of course the other way around this is to ignore the higher-D libraries, and have them conform to the API once it's settled. That might be more reasonable, in which case, I'm perfectly happy for this to be closed. 😊

rgommers commented 3 years ago

Interesting question, thanks @jni. My first thought is that DataArray is not actually a "higher-dimensional dataframe". A key thing for dataframes is that each column can have a separate dtype, and as a result the conceptual model of a dataframe as an ordered set of 1-D columns is quite natural. A DataArray on the other hand must have a uniform dtype, and therefore that "ordered set of columns" doesn't seem to fit well.

Unless I'm missing something, a DataArray is much closer to an ndarray/tensor than to a dataframe - in particular for the purposes of data exchange between libraries, where we're very interested in memory layout. The conceptual model for a DataArray I'd use is:

  1. An n-dimensional array
  2. Allow using string names for each dimension
  3. Add some dataframe-like APIs for using those string names for, e.g., indexing

If there'd be multiple libraries with a DataArray-like data structure, allowing adding the dimension names to __dlpack__ seems like a more natural step, and probably all that would be needed?

kkraus14 commented 3 years ago

I agree that a top level DataArray analogous to a DataFrame doesn't make sense in the context of the DataFrame protocol.

I interpreted the question as wanting to store n-dimensional data in a column of a DataFrame, where presumably the first dimension is equal to the number of rows in the DataFrame. This sounds very reasonable and a worthwhile extension to support in the future. This could be supported via something similar to the Arrow FixedSizeListArray: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html

shwina commented 3 years ago

This could be supported via something similar to the Arrow FixedSizeListArray

Should the two be considered conceptually similar? For example, would operations on n-dimensional FixedSizeListArrays follow the same semantics (e.g., broadcasting) as arrays as defined by the array API standard?

rgommers commented 3 years ago

I interpreted the question as wanting to store n-dimensional data in a column of a DataFrame, where presumably the first dimension is equal to the number of rows in the DataFrame

Ah I can see that as being feasible. In that case it hasn't got much to do with xarray.DataArray anymore right? The one label that's then present is the column label, so the n-dim column just has data with a homogeneous dtype, i.e. it's a regular array?

dhirschfeld commented 3 years ago

This seems similar to numpy record arrays where individual items can be multidimensional arrays - e.g.

>>> recarray = np.empty(10, dtype=[('x', np.int64), ('y', np.float64, (3, 4)), ('z', str)])
>>> recarray[0]
(0, [[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]], '')
>>> recarray[0][0]
0
>>> recarray[0][1]
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
>>> recarray['x'].shape
(10,)
>>> recarray['y'].shape
(10, 3, 4)

IIUC this isn't an often used feature, but it can be very powerful/expressive so I think it would be worthwhile to support if it doesn't add too much complexity.

the conceptual model of a dataframe as an ordered set of 1-D columns is quite natural

In this world, the DataFrame would instead be an ordered/labelled set of arrays, each with a homogeneous dtype - i.e. you'd just drop the 1D requirement.

jni commented 3 years ago

@rgommers @kkraus14 note that I proposed that DataArrays would be conceptually equivalent to columns. An xarray Dataset would be equivalent to a Dataframe. ie, as @dhirschfeld notes, we are merely dropping the 1D requirement of a column, everything else remains the same.

However, I'm not familiar enough with xarray indexing semantics to understand further implications, e.g. do indices now have to have as many dimensions as the highest-dimensional DataArray in the Dataset?

kkraus14 commented 3 years ago

Should the two be considered conceptually similar? For example, would operations on n-dimensional FixedSizeListArrays follow the same semantics (e.g., broadcasting) as arrays as defined by the array API standard?

I would argue no, with the caveat that we should enable going to array libraries zero copy if possible in these situations. I think the fact that we can have nulls at any level of the array makes them different enough where we shouldn't implement broadcasting. Additionally, primitive typed columns without nulls are functionally equivalent to a 1d-array and broadcasting isn't supported on them.

Sounds like the request is for n-dimensional columns, which sounds reasonable and in scope for the project once we start tackling nested types more generally.

SimonHeybrock commented 3 years ago

@jni I would also be interested in this. scipp relies heavily on a DataArray type (similar to xarray.DataArray but with some differences). From my point of view it is hard to define Dataset/DataFrame directly based on arrays. I feel there is (at least) one intermediate conceptual level:

  1. Plain arrays, as define in array-api.
  2. Labelled dimensions (mapping dimension name to axis index), which are used for slicing, something like var['temperature', 4].
  3. Coordinates/indices for dimensions to label the axes and support label-based indexing.
  4. Dataset/DataFrame defined as dict-like with matching dim labels and indices. Could be:
    • 1-d arrays yield a pandas-style DataFrame
    • 2-d arrays yield a dict of, e.g., "images ". Can be, e.g., sliced to obtain a classical table/dataframe
    • Columns with mixed dimensionality.
    • ...

Without items 2.) and 3.) there is a big gap to bridge between 1.) and 4.), which is probably ok for the pandas-style DataFrame, but might limit the usefulness of a standard for non-1-d applications.

Have items 2.) and 3.) been discussed anywhere? I am a bit late to the party and try to catch up with some reading...