Closed MarcoGorelli closed 1 year ago
I'd say that yes, adding __dlpack__
is a good idea. It should raise for dtypes that aren't supported (string, datetime, missing values present), and it will work as expected otherwise with numpy
(including the common np.asarray(input_array)
pattern) and any other library that implements DLPack support.
Of course, longer-term it'd be also great if Matplotlib & co. gained support for Column
objects directly. In the meantime I think there's a gap between array libraries which have a library-independent protocol for array values, and dataframe libraries which don't have a column/series equivalent. Another option there is to lean on the dataframe interchange protocol, which does have support. So if Matplotlib sees something with __dataframe__
, it could check if that has a single column and if so, convert it for example to a pandas series. That'd be pretty pragmatic and could be implemented today.
So if Matplotlib sees something with dataframe, it could check if that has a single column and if so, convert it for example to a pandas series. That'd be pretty pragmatic and could be implemented today.
Right, thanks - I think this is good enough for now
If we would be pragmatic, we would add __array__
to the Column object (or a method like to_numpy()
to explicitly convert to a numpy array), cfr https://github.com/data-apis/dataframe-api/issues/66
@jorisvandenbossche I'll note that __dlpack__
now gets you the same effect (because np.asarray
understands it) without enforcing a hard dependency on numpy
for every dataframe library. And if you already do depend on numpy
, then the implementation of __dlpack__
can be as simple as:
class Column
def __dlpack__(self, *, stream=None):
if stream is not None:
raise NotImplementedError('my_df_lib does not support CUDA streams')
# `_arr` is the numpy array that you'd want to return if you had implemented __array__
return self._arr.__dlpack__()
And then Column
can document that users can use xp.asarray
to obtain an array from a column for numpy
and any other array library that supports the array API standard, and also from_dlpack
for the ones that don't.
Do you agree that that is pragmatic enough?
That only works for numeric data types and for dataframes libraries that use dlpack-compatible memory under the hood, though?
So for example, I don't think that works for datetimes and strings (either as numpy's fixed width dtype or as object dtype)? Both those are supported by matplotlib when using numpy arrays.
And what are the expectations around implementing __dlpack__
if that conversion is not zero-copy? (different memory layout or bit width, missing values with a bit/byte mask, ...) For example, in the interchange protocol, __dlpack__
is only available on the Buffer object (which avoids those questions), and not on the Column object.
Of course for __array__
/to_numpy()
, you have the same questions regarding zero-copy, but I think 1) historically __array__
in practice already often does conversions involving copies, and 2) when using a method there could be keywords controlling those aspects of the behaviour.
So for example, I don't think that works for datetimes and strings (either as numpy's fixed width dtype or as object dtype)? Both those are supported by matplotlib when using numpy arrays.
Ah, good point, that's a gap. There's a standards vs. pragmatism tension there. It'd be nice if that was solved with something that's in principle library-independent (e.g., __column__
so you could do pd.Series(col).to_numpy()
). Couldn't we add that fairly easily? We already have 95% of the code needed for implementations, it'd just need the object that is already backing __dataframe__().get_column()
& co.
Perhaps let's talk about this tomorrow
Rethinking about:
Another option there is to lean on the dataframe interchange protocol
Sure, but in that case we wouldn't need to do any work on the Standard if they're just doing to use the interchange protocol directly?
For matplotlib, I think all they need is something they can iterate over. E.g. this can be plotted:
import matplotlib.pyplot as plt
import numpy as np
class MyIter:
def __init__(self, arr):
self.arr = arr
def __getitem__(self, idx):
return self.arr[idx]
def __len__(self):
return len(self.arr)
myiter = MyIter(np.array([1,2,3]))
fig, ax = plt.subplots()
ax.plot(myiter)
We're explicitly ruling out letting consumers iterate over elements in a Column
, so passing a Column
to matplotlib
wouldn't work
Perhaps we just need a to_iterable
method?
There's a standards vs. pragmatism tension there. It'd be nice if that was solved with something that's in principle library-independent (e.g.,
__column__
so you could dopd.Series(col).to_numpy()
). Couldn't we add that fairly easily?
That example doesn't make it dataframe library independent, as matplotlib would then still need to use some specific dataframe library (pandas in your example) to get the actual data, while all it wants in an array. I think a goal should be that libraries like matplotlib could accept any dataframe-like object without having to rely on a specific one being installed.
Tapping into Marco's latest comment, it also doesn't necessarily need to hardcode "numpy". We could also have a to_array()
method that ensures you get back "some" object that has array interfaces (eg __dlpack__
), so that someone could do np.asarray(col.to_array())
if they know they want a numpy array.
We could also have a
to_array()
method that ensures you get back "some" object that has array interfaces (eg__dlpack__
), so that someone could donp.asarray(col.to_array())
I think I quite like this idea. That allows nandas to return a numpy array, cuDF a cupy array, and so on. The main follow-up question I have here is: what guarantees do we give about the returned array object? __dlpack__
seems like it should be present, but if it's a non-numeric/bool column then that needs to not be there. Calling asarray()
from an array library on it should do the right thing as much as possible. Is that the only thing that's allowed? Or should it add __array_namespace__
for even more generality?
I think there's two separate things folks may want here: 1) Return an object guaranteed to be array-like from the perspective of supporting the array interchange protocol 2) Return an object guaranteed to be array-like from the perspective of supporting the array API
The two could be the same object or they could be different objects. I.E. you could imagine a distributed library that has 2 return a distributed array implementation whereas 1 guarantees local memory.
Right, let's try to get this in, as it's a fairly important one. We can always revisit later if what we get into the first version isn't good enough
We could also have a to_array() method that ensures you get back "some" object that has array interfaces (eg dlpack), so that someone could do np.asarray(col.to_array()) if they know they want a numpy array.
Concretely, what methods does the return value need to have? You wrote above that __dlpack__
would be too limiting because it would only allow for numeric datatypes.
Which method(s) should the return value need to have?
I think there's two separate things folks may want here:
That's a good point. For (1) I think xp.asarray(df.to_array())
should work. (2) is a lot more work I guess, and not everyone may want to implement that. But if it is implemented, it should be indicated by the presence of __array_namespace__
and then the complete standard should work.
Concretely, what methods does the return value need to have? You wrote above that
__dlpack__
would be too limiting because it would only allow for numeric datatypes.
I think we should focus on (1) - interchange to an actual array object containing the data. For that, I'd say the primary spec should be "asarray(df.to_array())
has to work whenever possible." Methods are tricky, I'm thinking:
__dlpack__
is mandatory for bool
and numerical data types__array__
, __cuda_array_interface__
, __array_interface__
.dlpack is mandatory for bool and numerical data types
does it work for bool?
In [19]: np.array([True, True]).__dlpack__()
---------------------------------------------------------------------------
BufferError Traceback (most recent call last)
Cell In [19], line 1
----> 1 np.array([True, True]).__dlpack__()
BufferError: DLPack only supports signed/unsigned integers, float and complex dtypes.
argh, bool
dtype support was only recently added: https://github.com/dmlc/dlpack/issues/75. and NumPy still needs an update. Okay, scratch bool
then, unfortunately.
The to_array()
method could also have a "null_value" (or similar) keyword, to indicate which value to use instead of null (since arrays don't support nulls)
And maybe also a target dtype? (although the question then becomes how to specify this dtype, unless that's something the array API spec has resolved?) For example, if you want integers for datetime64 (to have dlpack support), or want to specify you want object dtype for strings (although that's probably too numpy specific?), for float for integers with missing values.
let's take the conversation on to_array
over to https://github.com/data-apis/dataframe-api/issues/139 and keep this one focused on to_iterable
regarding to_iterable
- I'm not really sure it's necessary
if we have Column.__getitem__
and Column.__len__
, then people can iterate over the elements manually however they want (if they really need to), and by keeping it out of the standard we don't risk encouraging inefficient patterns
regarding
to_iterable
- I'm not really sure it's necessaryif we have
Column.__getitem__
andColumn.__len__
, then people can iterate over the elements manually however they want (if they really need to), and by keeping it out of the standard we don't risk encouraging inefficient patterns
I think before we can commit to this we need to have alignment on what type is returned from Column.__getitem__
. I.E. for cuDF if you call __getitem__
against a cudf.Series
with a scalar then it returns a cudf.Scalar
that allows taking some GPU fast paths if you feed that scalar into other cuDF APIs taking scalars.
then it returns a cudf.Scalar
this sounds fine, we can note that it's implementation-specific
Sounds fine to me too - and that is generically true for any Python objects (scalars, tuples, etc.); they can (almost?) always be replaced by duck-type-compatible objects that the library prefers.
It was brought up in the last call that there's probably a need to be able to access the values inside a
Column
- for example, if passing aDataFrame
to a plotting library, to be able to do:Probably is - how to do that?
Should each
Column
have a__dlpack__
method, so that one can call e.g.np.from_dlpack(column)
and get a numpy array they can pass tomatplotlib
?