data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
102 stars 20 forks source link

to_iterable #109

Closed MarcoGorelli closed 1 year ago

MarcoGorelli commented 1 year ago

It was brought up in the last call that there's probably a need to be able to access the values inside a Column - for example, if passing a DataFrame to a plotting library, to be able to do:

ax.plot(df.get_column_by_name('x'), df.get_column_by_name('y'))

Probably is - how to do that?

Should each Column have a __dlpack__ method, so that one can call e.g. np.from_dlpack(column) and get a numpy array they can pass to matplotlib?

rgommers commented 1 year ago

I'd say that yes, adding __dlpack__ is a good idea. It should raise for dtypes that aren't supported (string, datetime, missing values present), and it will work as expected otherwise with numpy (including the common np.asarray(input_array) pattern) and any other library that implements DLPack support.

Of course, longer-term it'd be also great if Matplotlib & co. gained support for Column objects directly. In the meantime I think there's a gap between array libraries which have a library-independent protocol for array values, and dataframe libraries which don't have a column/series equivalent. Another option there is to lean on the dataframe interchange protocol, which does have support. So if Matplotlib sees something with __dataframe__, it could check if that has a single column and if so, convert it for example to a pandas series. That'd be pretty pragmatic and could be implemented today.

MarcoGorelli commented 1 year ago

So if Matplotlib sees something with dataframe, it could check if that has a single column and if so, convert it for example to a pandas series. That'd be pretty pragmatic and could be implemented today.

Right, thanks - I think this is good enough for now

jorisvandenbossche commented 1 year ago

If we would be pragmatic, we would add __array__ to the Column object (or a method like to_numpy() to explicitly convert to a numpy array), cfr https://github.com/data-apis/dataframe-api/issues/66

rgommers commented 1 year ago

@jorisvandenbossche I'll note that __dlpack__ now gets you the same effect (because np.asarray understands it) without enforcing a hard dependency on numpy for every dataframe library. And if you already do depend on numpy, then the implementation of __dlpack__ can be as simple as:

class Column
    def __dlpack__(self, *, stream=None):
        if stream is not None:
            raise NotImplementedError('my_df_lib does not support CUDA streams')

        # `_arr` is the numpy array that you'd want to return if you had implemented __array__ 
        return self._arr.__dlpack__()

And then Column can document that users can use xp.asarray to obtain an array from a column for numpy and any other array library that supports the array API standard, and also from_dlpack for the ones that don't.

Do you agree that that is pragmatic enough?

jorisvandenbossche commented 1 year ago

That only works for numeric data types and for dataframes libraries that use dlpack-compatible memory under the hood, though?

So for example, I don't think that works for datetimes and strings (either as numpy's fixed width dtype or as object dtype)? Both those are supported by matplotlib when using numpy arrays.

And what are the expectations around implementing __dlpack__ if that conversion is not zero-copy? (different memory layout or bit width, missing values with a bit/byte mask, ...) For example, in the interchange protocol, __dlpack__ is only available on the Buffer object (which avoids those questions), and not on the Column object. Of course for __array__/to_numpy(), you have the same questions regarding zero-copy, but I think 1) historically __array__ in practice already often does conversions involving copies, and 2) when using a method there could be keywords controlling those aspects of the behaviour.

rgommers commented 1 year ago

So for example, I don't think that works for datetimes and strings (either as numpy's fixed width dtype or as object dtype)? Both those are supported by matplotlib when using numpy arrays.

Ah, good point, that's a gap. There's a standards vs. pragmatism tension there. It'd be nice if that was solved with something that's in principle library-independent (e.g., __column__ so you could do pd.Series(col).to_numpy()). Couldn't we add that fairly easily? We already have 95% of the code needed for implementations, it'd just need the object that is already backing __dataframe__().get_column() & co.

MarcoGorelli commented 1 year ago

Perhaps let's talk about this tomorrow

Rethinking about:

Another option there is to lean on the dataframe interchange protocol

Sure, but in that case we wouldn't need to do any work on the Standard if they're just doing to use the interchange protocol directly?

For matplotlib, I think all they need is something they can iterate over. E.g. this can be plotted:

import matplotlib.pyplot as plt
import numpy as np

class MyIter:
    def __init__(self, arr):
        self.arr = arr

    def __getitem__(self, idx):
        return self.arr[idx]

    def __len__(self):
        return len(self.arr)

myiter = MyIter(np.array([1,2,3]))
fig, ax = plt.subplots()
ax.plot(myiter)

image

We're explicitly ruling out letting consumers iterate over elements in a Column, so passing a Column to matplotlib wouldn't work

Perhaps we just need a to_iterable method?

jorisvandenbossche commented 1 year ago

There's a standards vs. pragmatism tension there. It'd be nice if that was solved with something that's in principle library-independent (e.g., __column__ so you could do pd.Series(col).to_numpy()). Couldn't we add that fairly easily?

That example doesn't make it dataframe library independent, as matplotlib would then still need to use some specific dataframe library (pandas in your example) to get the actual data, while all it wants in an array. I think a goal should be that libraries like matplotlib could accept any dataframe-like object without having to rely on a specific one being installed.

Tapping into Marco's latest comment, it also doesn't necessarily need to hardcode "numpy". We could also have a to_array() method that ensures you get back "some" object that has array interfaces (eg __dlpack__), so that someone could do np.asarray(col.to_array()) if they know they want a numpy array.

rgommers commented 1 year ago

We could also have a to_array() method that ensures you get back "some" object that has array interfaces (eg __dlpack__), so that someone could do np.asarray(col.to_array())

I think I quite like this idea. That allows nandas to return a numpy array, cuDF a cupy array, and so on. The main follow-up question I have here is: what guarantees do we give about the returned array object? __dlpack__ seems like it should be present, but if it's a non-numeric/bool column then that needs to not be there. Calling asarray() from an array library on it should do the right thing as much as possible. Is that the only thing that's allowed? Or should it add __array_namespace__ for even more generality?

kkraus14 commented 1 year ago

I think there's two separate things folks may want here: 1) Return an object guaranteed to be array-like from the perspective of supporting the array interchange protocol 2) Return an object guaranteed to be array-like from the perspective of supporting the array API

The two could be the same object or they could be different objects. I.E. you could imagine a distributed library that has 2 return a distributed array implementation whereas 1 guarantees local memory.

MarcoGorelli commented 1 year ago

Right, let's try to get this in, as it's a fairly important one. We can always revisit later if what we get into the first version isn't good enough

We could also have a to_array() method that ensures you get back "some" object that has array interfaces (eg dlpack), so that someone could do np.asarray(col.to_array()) if they know they want a numpy array.

Concretely, what methods does the return value need to have? You wrote above that __dlpack__ would be too limiting because it would only allow for numeric datatypes. Which method(s) should the return value need to have?

rgommers commented 1 year ago

I think there's two separate things folks may want here:

That's a good point. For (1) I think xp.asarray(df.to_array()) should work. (2) is a lot more work I guess, and not everyone may want to implement that. But if it is implemented, it should be indicated by the presence of __array_namespace__ and then the complete standard should work.

Concretely, what methods does the return value need to have? You wrote above that __dlpack__ would be too limiting because it would only allow for numeric datatypes.

I think we should focus on (1) - interchange to an actual array object containing the data. For that, I'd say the primary spec should be "asarray(df.to_array()) has to work whenever possible." Methods are tricky, I'm thinking:

MarcoGorelli commented 1 year ago

dlpack is mandatory for bool and numerical data types

does it work for bool?

In [19]: np.array([True, True]).__dlpack__()
---------------------------------------------------------------------------
BufferError                               Traceback (most recent call last)
Cell In [19], line 1
----> 1 np.array([True, True]).__dlpack__()

BufferError: DLPack only supports signed/unsigned integers, float and complex dtypes.
rgommers commented 1 year ago

argh, bool dtype support was only recently added: https://github.com/dmlc/dlpack/issues/75. and NumPy still needs an update. Okay, scratch bool then, unfortunately.

jorisvandenbossche commented 1 year ago

The to_array() method could also have a "null_value" (or similar) keyword, to indicate which value to use instead of null (since arrays don't support nulls)

jorisvandenbossche commented 1 year ago

And maybe also a target dtype? (although the question then becomes how to specify this dtype, unless that's something the array API spec has resolved?) For example, if you want integers for datetime64 (to have dlpack support), or want to specify you want object dtype for strings (although that's probably too numpy specific?), for float for integers with missing values.

MarcoGorelli commented 1 year ago

let's take the conversation on to_array over to https://github.com/data-apis/dataframe-api/issues/139 and keep this one focused on to_iterable

MarcoGorelli commented 1 year ago

regarding to_iterable - I'm not really sure it's necessary

if we have Column.__getitem__ and Column.__len__, then people can iterate over the elements manually however they want (if they really need to), and by keeping it out of the standard we don't risk encouraging inefficient patterns

kkraus14 commented 1 year ago

regarding to_iterable - I'm not really sure it's necessary

if we have Column.__getitem__ and Column.__len__, then people can iterate over the elements manually however they want (if they really need to), and by keeping it out of the standard we don't risk encouraging inefficient patterns

I think before we can commit to this we need to have alignment on what type is returned from Column.__getitem__. I.E. for cuDF if you call __getitem__ against a cudf.Series with a scalar then it returns a cudf.Scalar that allows taking some GPU fast paths if you feed that scalar into other cuDF APIs taking scalars.

MarcoGorelli commented 1 year ago

then it returns a cudf.Scalar

this sounds fine, we can note that it's implementation-specific

rgommers commented 1 year ago

Sounds fine to me too - and that is generically true for any Python objects (scalars, tuples, etc.); they can (almost?) always be replaced by duck-type-compatible objects that the library prefers.