Open jorisvandenbossche opened 3 years ago
I think asking for a column / DataFrame in a single chunk is something reasonable (whether or not it's part of the standard or interchange protocol). If we had the ability to get the column as a single chunk then the utility function becomes really straightforward or just becomes something like a np.asarray
call.
If we had the ability to get the column as a single chunk
We already have this ability with the currently documented API, I think (the methods get_column()
/get_column_by_name()
and num_chunks
/ get_chunks()
on the column should be sufficient to get a single-chunk column).
the utility function becomes really straightforward or just becomes something like a np.asarray call.
A np.asarray
call only works if we add something like __array__
or __array_interface__
to Column, which we currently don't specify (cfr https://github.com/data-apis/dataframe-api/issues/48).
In case you meant calling it on the individual Buffer, that in itself will become trivial once numpy supports dlpack, yes. You still need to handle the different buffers and dtypes etc. A quick attempt at a version with only limited functionality:
def column_to_numpy_array(col):
assert col.num_chunks == 1 # for now only deal with single chunks
kind, _, format_str, _ = col.dtype
if kind not in (0, 1, 2, 22):
raise TypeError("only numeric and datetime dtypes are supported")
if col.describe_null[0] not in (0, 1):
raise NotImplementedError("Null values represented as masks or "
"sentinel values not handled yet")
buffer, dtype = col.get_buffers()["data"]
arr = buffer_to_ndarray(buffer, dtype) # this can become `np.asarray` or `np.from_dlpack` in the future
if kind == 22: # datetime
unit = format_string.split(":")[-1]
arr = arr.view(f"datetime64[{unit}]")
return arr
where buffer_to_ndarray
is currently something like https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/pandas_implementation.py#L116, but in the future can become a single numpy call once numpy supports DLPack.
That's certainly relatively straightforward code, but also dealing with a lot of details of the protocol, and IMO not something many end users should have to implement themselves.
We already have this ability with the currently documented API, I think (the methods
get_column()
/get_column_by_name()
andnum_chunks
/get_chunks()
on the column should be sufficient to get a single-chunk column).
I meant more along the lines of given a column with multiple chunks, requesting the column to combine its chunks into a single chunk so that it has a contiguous buffer under the hood.
This is a nice example, thanks @jorisvandenbossche. I feel like df.get_column_by_name("x").get_buffers()
is taking a wrong turn though - an end user library should indeed not need to deal with buffers directly.
_xref [the plot() docs, see under "Plotting labelled data"_
I think this would work:
df_obj = obj.__dataframe__().get_columns_by_name([x, y])
df = pd.from_dataframe(df)
xvals = df[x].values
yvals = df[y].values
Currently they require
obj["x"]
to give the desired data
That's not the actual requirement today I think - it's that np.asarray(obj[x]))
returns the data as a numpy array. Which is a fairly specific requirement - but even so it can be made to work just fine, on the condition that if the user uses the data=obj
syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray
call succeeds.
I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))
I'm not sure about Matplotlib, but I do know that Napari would like this and has tried to improve compatibility with PyTorch and other libraries.
on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.
IMO that's the big downside of your code snippet. As a pandas maintainer I of course don't mind that people need pandas :), but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas? Also, for the "guarantee that np.asarray call succeeds", that's basically something you can do based on the buffers in the interchange protocol (https://github.com/data-apis/dataframe-api/issues/66#issuecomment-918882569), if the original dataframe library doesn't support it directly. But then we get back to the point that ideally library users of the protocol shouldn't get down to the buffer level.
but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Well, I see it as: the protocol supports turning one kind of dataframe into another kind, so as a downstream library if you support one specific library, you get all the other ones for free.
Really what Matplotlib wants here is: turn a single column into a numpy.ndarray
. But if we support that, it should either be generic (like a potentially non-zero-copy way to use DLPack and/or the buffer protocol on a column), or we should support other array libraries too. Otherwise it's pretty ad-hoc imho.
but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Second thought: that is step two in the Consortium efforts - you need the generic public API, not just the interchange protocol. That's also what's said at https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#progression-timeline.
We discussed this in a call, and the sentiment was that it would be very nice to have this Matplotlib use case work, and not have it wait for another API that is still to be designed.
For a column one can get from the dataframe interchange protocol, it would be very useful if that could be turned into an array (any kind of array which the consuming library - Matplotlib in this case - wants). Options to achieve that include:
__dlpack__
should live on the column or the buffer level, and for `__array_interface__ et al.)from_column
function there to create its own kind of arrayColumn
-> numpy.ndarray
path)The separate utility library likely makes the most sense. Benefits are: this code then only has to be written once, it keeps things outside of the protocol/standard, and it can be made available fairly quickly (no need to wait for multiple array libraries to implement something and then do a release).
To make the code independent of any array or dataframe library, it may have to look something like:
def array_from_column(
df: DataFrame,
column_name: str,
xp: Any, # object/namespace implementing the array API
) -> <array>:
"""
Produces an array from a column, if possible.
Will raise a ValueError in case the column contains missing data or has a dtype
that is not supported by the array API standard
"""
It's likely also practical to have a separate column_to_numpy
function, given that Matplotlib wants (a) a numpy.ndarray
rather than the numpy.array_api
array object, and (b) needs things to work with 2 year old numpy releases. If this is in a separate utility library and in no way directly incorporated in the standard, the objections to incorporating numpy-specific things should not apply here.
I think it's useful to think through concrete use cases on how the interchange protocol could be used, to see if it covers those use cases / the desired APIs are available. One example use case could be matplotlib's
plot("x", "y", data=obj)
, where matplotlib already supports getting the x and y column of any "indexable" object. Currently they requireobj["x"]
to give the desired data, but so in theory this support could be extended to any object that supports the dataframe interchange protocol. But at the same time, matplotlib currently also needs those data (AFAIK) as numpy arrays because the low-level plotting code is implemented in such a way.With the current API, matplotlib could do something like:
where
some_utility_func
can convert the dict ofBuffer
objects to a numpy array (once numpy supports dlpack, converting the Buffer objects to numpy will become easy, but the function will then still need to handle potentially multiple buffers returned fromget_buffers()
).That doesn't seem ideal: 1) writing the
some_utility_func
to do the conversion to numpy is non-trivial to implement for all different cases, 2) should an end-user library have to go down to the Buffer objects?This isn't a pure interchange from one dataframe library to another, so we could also say that this use case is out-of-scope at the moment. But on the other hand, it seems a typical use case example, and could in theory already be supported right now (it only needs the "dataframe api" to get a column, which is one of the few things we already provide).
(disclaimer: I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))