data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
99 stars 20 forks source link

Data exchange formats #29

Closed datapythonista closed 3 years ago

datapythonista commented 4 years ago

Based on what it's defined in https://github.com/wesm/dataframe-protocol/pull/1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).

Using a code example here, to see what this approach implies.

1. Dataframe implementations should implement the __dataframe__, returning the exchange format we are defining

For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:

import pyarrow

class VaexExchangeDataFrame:
    """
    The format defined by our spec.

    Besides `to_arrow`, `to_numpy` it should implement the rest of
    the spec `num_rows`, `num_columns`, `column_names`...
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data

    def to_arrow(self):
        return self.arrow_data

    def to_numpy(self):
        raise NotImplementedError('numpy format not implemented')

class VaexDataFrame:
    """
    The public Vaex dataframe class.

    For simplicity of the example, this just wraps an arrow object received in the constructor,
    but this would be the whole `vaex.DataFrame`.
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data

    def __dataframe__(self):
        return VaexExchangeDataFrame(self.arrow_data)

# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
                                                                       type='string'),
                                                         pyarrow.array([26_300, 4_900, 5_200],
                                                                       type='uint32')],
                                                        ['name', 'github_stars']))

Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:

import numpy

class ModinExchangeDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def to_arrow(self):
        raise NotImplementedError('arrow format not implemented')

    def to_numpy(self):
        return self.numpy_data

class ModinDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def __dataframe__(self):
        return ModinExchangeDataFrame(self.numpy_data)

modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
                           'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})

2. Direct consumers should be able to understand all formats

For example, pandas could implement a from_dataframe function to create a pandas dataframe from different formats:

import pandas

def from_dataframe(dataframe):
    known_formats = {'numpy': lambda df: pandas.DataFrame(df),
                     'arrow': lambda df: df.to_pandas()}

    exchange_dataframe = dataframe.__dataframe__()
    for format_ in known_formats:
        try:
            data = getattr(exchange_dataframe, f'to_{format_}')()
        except NotImplementedError:
            pass
        else:
            return known_formats[format_](data)

    raise RuntimeError('Dataframe does not support any known format')

pandas.from_dataframe = from_dataframe

This would allow pandas user to load data from other formats:

pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)

Vaex, Modin and any other implementation could implement an equivalent function to load data from other libraries into their formats.

3. Indirect consumers can pick an implementation, and use it to standardize its input

For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using from_dataframe from the previous section:

def seaborn_bar_plot(any_dataframe, x, y):
    pandas_df = pandas.from_dataframe(any_dataframe)
    return pandas_df.plot(kind='bar', x=x, y=y)

seaborn_bar_plot(vaex_df, x='name', y='github_stars')

Are people happy with this approach?

CC: @rgommers

kkraus14 commented 4 years ago

I'm a very strong -1 to this approach. Once you have something like to_numpy or to_pandas as part of the API, it ends up being what people standardize around which forces library maintainers to implement it for compatibility even though it's possibly an extremely inefficient path.

I pushed in https://github.com/wesm/dataframe-protocol/pull/1 to move to using more protocols / standards to allow for more efficient interoperability. I.E. instead of to_numpy you could have a to_array_like which guarantees that it returns an object that supports the standardized array compute API and supports the standardized array data protocol that are being defined.

From my perspective, we should define something like a __dataframe_interface__ in a somewhat similar vein to the Apache Arrow C Data Interface (https://arrow.apache.org/docs/format/CDataInterface.html) to be used for exchanging data efficiently.

datapythonista commented 4 years ago

I.E. instead of to_numpy you could have a to_array_like which guarantees that it returns an object that supports the standardized array compute API and supports the standardized array data protocol that are being defined.

Thanks for the feedback @kkraus14.

Just to make sure I understand (what you say here, and in Wes' PR thread). You are happy with the general approach, but you'd want that instead of columns having a def to_numpy() -> numpy.array: (and to_arrow() and others) you would prefer something like:

class StandardArray:
    """We should create a spec for this I guess."""
    def __array__(self):
        ...

class Column:
    def to_array_like(self) -> StandardArray:
        ...

Am I understanding correctly? Sorry, it's a bit difficult to understand in detail what you propose without seeing code.

jorisvandenbossche commented 4 years ago

To concur with Marc's question for clarification, @kkraus14 it is not fully clear to me what exactly you are objecting. It seems that you are opposed to the specific to_numpy and to_pandas methods, but I understood @datapythonista's code example to be rather an dummy code snippet for illustration of the proposed mechanism (i.e. __dataframe__ returning a "proxy" object with a defined interface), not a concrete proposal for the exact methods the interface will have.

So personally, I am +1 on this mechanism, and the exact methods that the returned proxy object will have is of course further to be discussed (as there are already other issues open about other aspects of the interface, like for the number of rows / columns)

kkraus14 commented 4 years ago

My fear is that defining what's allowed / standard in this mechanism is going to be extremely problematic and lead people towards using Pandas / Numpy as the interop through the exchange, which we'd obviously like to avoid. Once there's to_arrow, to_numpy, to_pandas, etc. defined somewhere they end up becoming what everyone uses regardless of whether or not they're actually dependent on specifics from those libraries.

Instead of this mechanism I'd like to see something akin to https://arrow.apache.org/docs/format/CDataInterface.html or https://numpy.org/doc/stable/reference/arrays.interface.html#python-side.

You could imagine something along the lines of:

class cuDFBuffer:
    """
    The public cuDF buffer class.
    """
    def __init__(self, data):
        self.data = data

    def __buffer__(self):
        """
        Produces a dictionary object following the buffer protocol spec to
        describe the memory of the buffer.

        This could likely piggyback off of an array protocol spec for data
        exchange.
        """
        return {
            "size": self.size, # Number of bytes in the buffer
            "ptr": self.ptr, # Pointer to the buffer as an integer
            "read_only": self.read_only, # Whether the buffer is read only
            "version": 0 # Version number of the protocol
        }

class cuDFColumn:
    """
    The public cuDF column class.
    """
    def __init__(self, buffers):
        self.buffers = buffers

    def __column__(self):
        """
        Produces a dictionary object following the column protocol spec
        to describe the memory layout of the column.
        """
        return {
            "dtype": self.dtype, # Format string of the dtype
            "name": self.name, # Name of the column
            "length": self.size, # Number of elements in the column
            "null_count": self.null_count, # Number of null elements, optional
            "offset": self.offset, # Number of elements to offset the column
            "buffers": self.buffers, # Buffers underneath the column, each
                                     # object in this iterator must expose
                                     # buffer protocol
            "children": self.children, # Children columns underneath the column,
                                       # each object in this iterator must
                                       # expose column protocol
            "version": 0 # Version number of the protocol
        }

class cuDFDataFrame:
    """
    The public cuDF dataframe class.
    """
    def __init__(self, columns):
        self.columns = columns

    def __dataframe__(self):
        """
        Produces a dictionary object following the dataframe protocol spec
        to describe the memory layout of the dataframe.
        """
        return {
            "name": self.name, # Name of the dataframe
            "columns": self.columns, # Columns underneath the dataframe, each
                                     # object in this iterator must expose
                                     # column protocol
            "version": 0 # Version number of the protocol
        }

Note that there's a lot of attributes that would need to be captured in these protocols that I'm not covering here, nor am I sure that Python dictionaries are the right approach to this protocol, but the idea of expressing a hierarchy of objects that eventually point down to memory is what I have in mind for a data exchange protocol.

Then for dataframe libraries who want to implement a from_dataframe method, they use the protocols at various levels to handle the memory as desired. Will need to figure out how to specify if the memory is on CPU vs GPU vs TPU vs Disk, etc. but those are implementation details at a lower level than this discussion.

datapythonista commented 4 years ago

Thanks @kkraus14, it's very clear now.

If we standardize into a single protocol as you describe, as opposed to multiple representations, do you think it could make sense to use Arrow? My understanding is that Arrow's goal is to solve the problem we're addressing here, and while your proposal makes sense, I'm not sure if we may be reinventing the wheel. Any thoughts on this? Do you have in mind any limitation in Arrow, or any reason why you'd prefer to use a custom implementation, and not rely on Arrow?

For reference, Arrow implements what you've got in your example __buffer__:

import numpy
import pyarrow

numpy_array = numpy.random.rand(100)
buffer = pyarrow.serialize(numpy_array).to_buffer()

print(buffer.address)
print(buffer.size)
print(buffer.is_mutable)

In the example, Arrow is copying the memory from the NumPy array, not sure if avoiding the copy is possible.

kkraus14 commented 4 years ago

If we standardize into a single protocol as you describe, as opposed to multiple representations, do you think it could make sense to use Arrow?

From the cuDF perspective it would be good, but we are similarly a columnar dataframe and were built to be compatible with Arrow. For others it may not be as nice of a fit, especially if they don't follow the Arrow data model.

Do you have in mind any limitation in Arrow, or any reason why you'd prefer to use a custom implementation, and not rely on Arrow?

Being tied to Arrow as opposed to being an independent specification makes us tied to Arrow's implementation details. I.E. Arrow uses a single bit per value for booleans versus a byte per value, limited units for timestamps / durations, no support for bfloat types, no support for Python objects, etc. I don't know the appetite of the Arrow community to expand the specification to things that Arrow doesn't and potentially will not support.

Additionally, Arrow only has a specification at the Array level, they don't currently have a specification for the Table level or down at the Buffer level. The existing specifications for the Buffer level: Python buffer protocol, __array_interface__, __cuda_array_interface__ don't suffice as the protocol differs depending on whether the memory in on the CPU or the device.

jorisvandenbossche commented 4 years ago

Before considering doing something "alike Arrow C Data Interface but not exactly", I think we should have a better idea of use cases of such a protocol and where Arrow would not be sufficient.

If there are important use cases for dataframe-like applications that are currently hindered by limitations in the Arrow data types or C interface, then I think the Arrow community will be very interested to hear them and to discuss this. (of course, "dataframe-like" might be the critical word here, as I see there has been discussion (and not yet consensus) in https://github.com/data-apis/dataframe-api/issues/2 on what "kind" of dataframe is in scope for the standard)

Being tied to Arrow as opposed to being an independent specification makes us tied to Arrow's implementation details.

If we go with a low-level exchange protocol as you are outlining (and not relying on lazy conversion to a set of different possible formats such as numpy, pandas, arrow, object following array protocol, ..), then we need to choose some set of implementation details. Given the complexity of defining this (and given the goal of the Arrow project), I would personally say that re-using Arrow's implementation details is actually a plus and will avoid lots of discussion.

I.E. Arrow uses a single bit per value for booleans versus a byte per value, limited units for timestamps / durations, no support for bfloat types, no support for Python objects, etc. I don't know the appetite of the Arrow community to expand the specification to things that Arrow doesn't and potentially will not support.

This are indeed all types not directly supported in Arrow at the moment. Open issues about 8 bit booleans (ARROW-1674) and support for Python objects in ARROW IPC (ARROW-5931, but which is not the same as the C interface, to be clear). Re other units for timestamps, has there been any demand for this? (I am personally not aware of this being brought up in Arrow).

Additionally, Arrow only has a specification at the Array level, they don't currently have a specification for the Table level or down at the Buffer level.

The C Data Interface is actually already used for Tables as well (not directly, but RecordBatch is supported using a StructArray, and Tables can be seen as one or more RecordBatches): pyarrow.RecordBatch has _import_from_c / _export_to_c methods, and this is already used in the R package (using reticulate) to pass in-memory Tables from Python to R and back.


Besides the "to Arrow or not to Arrow" arguments above, I personally still think there is value in the original proposal (based on https://github.com/wesm/dataframe-protocol/pull/1). At the end of the day, people often need the data in a specific format (because their library is built on top of a certain library, because they have a cython algo that requires exactly a numpy array / memoryview buffer, ...), and the proposed mechanism allows this flexibility to let the user of the protocol choose in which format they need to the data, without doing any unneeded conversions beforehand.

TomAugspurger commented 4 years ago

At the end of the day, people often need the data in a specific format

Can we focus on this point before going into the weeds on anything else? Do we think that it's valuable to include something like a to_numpy() in an API standard, i.e. a standard way to take any dataframe object and convert it to a NumPy array? Keith's concern is that it

ends up being what people standardize around [...] even if it's inefficient.

IMO, the target audience for this API standard won't fall into that trap. The entire point of the standard is to make things agnostic to the particular dataframe implementation (at the API level). The motivation for including something like a to_numpy() is to satisfy users who want to consume arbitrary dataframes but need to produce a NumPy array because of some external constraint (their Cython algorithm only supports ndarrays for example). Do we consider that a valid use case that should be supported by the spec?

kkraus14 commented 4 years ago

Do we think that it's valuable to include something like a to_numpy() in an API standard, i.e. a standard way to take any dataframe object and convert it to a NumPy array?

I'm very against this. Why should a protocol include specific implementations for certain containers? I.E. you could imagine a to_array() instead of to_numpy() which is required to return an object that exposes some array protocol which guarantees that calling np.asarray(...) against it gives you a numpy array zero copy. This gives an efficient path to get to a numpy array for libraries which require it, but also doesn't force using numpy.

Say I'm an array library that can work on CPU/GPU/TPU/etc. and I want to handle being able to ingest a DataFrame from some random library, but I want to ensure that I keep the memory where it currently lives, how would I do that with this proposal?

maartenbreddels commented 4 years ago

Thanks @datapythonista and @kkraus14, you obviously put a lot of though into this.

I.E. you could imagine a to_array() instead of to_numpy() which is required to return an object that exposes some array protocol which guarantees that calling np.asarray(...) against it gives you a numpy array zero copy.

What is the added value of doing df.to_array() first? E.g. what is the difference between:

array2d = np.asarray(df.to_array())
array2d = np.asarray(df)

In the end, if either of these are possible, we have to specify in the spec that this is possible right? So in that case, what would favor np.asarray(df.to_array()) vs df.to_numpy(). Do we really need the .to_array()?

Is your point (@kkraus14) that you'd like to see only __array_interface__ and __cuda_array_interface__ (and possibly __arrow_array__) instead of to_{lib}, such that we don't explicitly talk about array container, but array (memory?) interfaces?

In this case a default implementation of to_array() would be (not that I'm favor of adding it, but just to make this explicit):

def to_numpy():
   return np.asarray(self)

And do we really need to have to_{lib} at the dataframe level, and instead put them on the column level only?

maartenbreddels commented 4 years ago

A point against using the array interface would be columns that have no memory representation.

E.g. a virtual column in Vaex is not materialized, there is no memory anywhere that holds the data. A virtual column in vaex can be materialized to a numpy or arrow array, and a real column can be backed by a numpy or arrow array, so I'm not sure how this would fit into the array protocols (but they can even be something else...).

From Vaex' point of view, implementing __array__ and __arrow_array__ (which I see as equivalent to to_numpy() and to_arrow(), am I right?) makes totally sense (at least for columns).

The __array_interfaces__ and __cuda_array_interface__ can make sense in some (but not all) cases. Vaex could always 'lie'/'cheat', create a numpy or cupy array on the fly, and return that instead, but I'm not sure that would be good behavior.

So what I am basically asking is: Do array protocols make sense for dataframes that have no materialized/real columns.

datapythonista commented 4 years ago

Thanks @maartenbreddels, that's a very good point, thanks for bringing it up.

I'm not quite understanding how both approaches are different regarding virtual columns. I guess the first thing to decide is what to do with them:

  1. Support them by materializing them first 2 Supporting them as virtual columns
  2. Not support them

For 1, I don't see how the approach of multi formats is different to the approach of an actual protocol. I guess you'll have to materialize them first when __dataframe__ is called in the producer, and then they'll become just regular columns, that you can get a pointer to or share as numpy arrays.

For 2, I think it's a bit trickier. I guess for it to work we need that all dataframe libraries support them, or that the consumer knows who how to materialize them given its formula. Assuming that is something we want to do, I guess the exchange will have to be implemented independently in both cases, since we'll be exchanging a formula, not a numpy array or a pointer.

So, while it's a very good point to take into consideration, I don't fully understand why you think the numpy+arrow approach is better than exchanging a pointer and metadata (with arrow or a new protocol). Seems like it's independent to the approach we use. I guess I'm missing something.

TomAugspurger commented 4 years ago

From Vaex' point of view, implementing array and __arrow_array__ (which I see as equivalent to to_numpy() and to_arrow(), am I right?)

In pandas, we've found that the __array__ protocol isn't flexible enough, so we defined .to_numpy which takes a few additional arguments. In particular, na_value has proved necessary.

A NumPy ndarray representing the values in this Series or Index.

Parameters
----------
dtype : str or numpy.dtype, optional
    The dtype to pass to :meth:`numpy.asarray`.
copy : bool, default False
    Whether to ensure that the returned value is not a view on
    another array. Note that ``copy=False`` does not *ensure* that
    ``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that
    a copy is made, even if not strictly necessary.
na_value : Any, optional
    The value to use for missing values. The default value depends
    on `dtype` and the type of the array.
maartenbreddels commented 4 years ago

@TomAugspurger Good point

@datapythonista: I don't like forcing 1 (materializing), I prefer that this stays an implementation detail.

The problem is not only limited to Vaex' virtual columns. The data could also live remote, in a database, or backed by dask. The point is that a column is not per se backed by an array, which makes exposing the array interface for a column unnatural. Of course, an implementation could lazily do this, by implementing __array_interface__/__cuda_array_interface__ with a property/getter.

Maybe I am starting to answer the question I posed to @kkraus14 about why we should have an explicit .to_array() method. The .to_array() method would force an implementation to materialize a column (or dataframe?) such that the array interface would make sense. Similarly a .to_cuda_array() could expose an object with a __cuda_array_interface__ and would allow libraries to place their data on the GPU in the most efficient way (or by passing the pointer when already materialized on the GPU).

I don't fully understand why you think the numpy+arrow approach is better than exchanging a pointer and metadata (with arrow or a new protocol).

I think we need both. AFAIK there is no __arrow_array_interface__, so I don't think we can expose the arrow arrays without depending on pyarrow (and returning a real pyarrow array). So that leaves us with __arrow_array__ and/or .to_arrow(). On the other hand, I see value in exposing the data via __array_interface__ and/or __cude_array_iterface__.

amueller commented 4 years ago

Re the discussion on the call, @TomAugspurger said:

IMO, the target audience for this API standard won't fall into that trap.

I think people will fall into this trap, but the only way to avoid it is documentation and communication, I don't think there's a technical solution.

The entire point of the standard is to make things agnostic to the particular dataframe implementation (at the API level). The motivation for including something like a to_numpy() is to satisfy users who want to consume arbitrary dataframes but need to produce a NumPy array because of some external constraint (their Cython algorithm only supports ndarrays for example).

I totally agree with that. What is the use-case of the proposed memory exchange protocol?

@kkraus14

but I want to ensure that I keep the memory where it currently lives, how would I do that with this proposal?

I'm pretty sure this is impossible to do by any technological way, unless there is no way to convert your structure to a numpy array. You can only ask downstream libraries not to do the expensive copy. And whether they do that or not is actually a really hard question, but I don't see how this relates to the syntax the downstream library uses to force a numpy array.

@kkraus14 I'm really not sure what use-case you have in mind for your general non-copy protocol.

[after more discussion, I understand the main use-case would be going from one cuda array to another cuda array and/or from one TPU library to another TPU array. I didn't think this was the problem we're trying to solve here, if it is, I don't see how it relates to the .to_numpy discussion].

kkraus14 commented 4 years ago

Say I have a dataframe library, dflib, that can run on CPU, GPU, TPU, and XPU and there's additionally array libraries for CPU, GPU, TPU, and XPU that I could possibly export to and all of these libraries support the Array compute API standard similar to Numpy. Instead of calling to_numpy(), I'd want to be able to call something more general like to_array_compute_like() which allows computations to continue to be controlled by the library that is accelerator aware. This doesn't make any guarantees about memory layout, just that the container produced allows using the same high level Python API.

Now, assume I have a Python library backed by a C/C++ library which ultimately needs to work with pointers, if it's backed by CPU memory I can potentially use Python buffer protocol or Numpy __array_interface__, if it's backed by CUDA memory I can potentially use __cuda_array_interface__, otherwise I need to tightly couple to the object container type. Now to add onto that, say I have a library that supports both CPUs and GPUs and has a container that exposes all 3 protocols, which one should I use? We'd really need a memory protocol that allows specifying where the memory lives in order for downstream libraries to make informed decisions. You could imagine an API of something like to_array_interface_object() which returns an object that exposes some new array protocol that encapsulates this information.

But say you have a library that uses the numpy C-API so it needs to guarantee a numpy array and it's being given an input from an XPU library. You could imagine adding an argument to the API of something like to_array_interface_object(device="CPU") which asks the producing library to give you a CPU backed object or presumably throw if it can't. Then you could call np.asarray(...) on the object to get a numpy array zero copy from it.

Given numpy is the standard, there's a lot of code out there that just calls np.asarray(obj) where obj could be any arbitrary object. Library maintainers of obj should have the ability to nicely prevent those potentially non-performant implicit paths from occurring, while still giving an explicit path for other library maintainers to use without enforcing a direct dependency on their project.

It's still a trap, but it's a much more explicit opt-in trap as opposed to a relatively implicit trap.

amueller commented 4 years ago

So you want a bridge from the dataframe-like API standard we are defining to the array-like API standard we are defining? Or to a different array standard that contains more meta-data?

devin-petersohn commented 4 years ago

I don't want to put words in @kkraus14's mouth, but my interpretation of the problem is that with cuDF they do not ever want to copy to numpy (or any other object that must live on host memory), they would prefer to not be forced into doing that with this spec. (please correct me if I am wrong here)

The solution proposed, np.asarray (or equivalent), would put the burden on the consumers of the dataframe to handle the conversion and error checking rather than on the maintainers of the dataframe to implement it.

Either way, if hypothetically some libraries will choose not to implement something because it is inefficient for them what is the best way forward @rgommers?

kkraus14 commented 4 years ago

I don't want to put words in @kkraus14's mouth, but my interpretation of the problem is that with cuDF they do not ever want to copy to numpy (or any other object that must live on host memory), they would prefer to not be forced into doing that with this spec. (please correct me if I am wrong here)

We want any copy to numpy to be explicit by the user or library instead of implicit. np.asarray is widely used where just making it work would cause more problems than it solves for many accelerator libraries. For this reason, libraries like cuDF and CuPy poison the method to throw an Exception instead of implementing the device-->host copy path for it to succeed.

The solution proposed, np.asarray (or equivalent), would put the burden on the consumers of the dataframe to handle the conversion and error checking rather than on the maintainers of the dataframe to implement it.

Dataframe libraries could still choose to implement the numpy array protocol if they want that to work. Having a specific memory exchange protocol and an associated argument to specify a device where you want the memory gives dataframe libraries the option to not support the numpy array protocol for example while still giving other library maintainers a more explicit path to copy from device --> host without forcing them to depend on the library directly.

Either way, if hypothetically some libraries will choose not to implement something because it is inefficient for them what is the best way forward @rgommers?

This came up in Wes's PR as well and the problem is you have landmine optional features. Something like to_numpy(), to_pandas(), or to_arrow() is a landmine where libraries like Pandas, Numpy, and Arrow will support them and people will then use them generally where they would actually be better served with something like to_array_compute_like() or to_dataframe_compute_like().

kkraus14 commented 4 years ago

So you want a bridge from the dataframe-like API standard we are defining to the array-like API standard we are defining? Or to a different array standard that contains more meta-data?

I'm using to_numpy purely as an example. We could replace that with to_arrow where returning an Arrow container object is likely to be similarly troublesome.

amueller commented 4 years ago

@kkraus14 sorry then I'm still lost on the use-case, clearly I'm missing something. You're saying we shouldn't have to_arrow so we are not limited to a particular container, and instead expose a more generic API. I thought the dataframe API was this more generic API.

I totally agree about the need to be explicit when a copy is forced and to allow downstream libraries to reason about capabilities of the different dataframe libraries. I thought what we were discussing was the interface to force a copy / cast to a known format. How/when/whether libraries use that seems like a social issue to me.

I don't understand what the return type of to_array_compute_like or to_dataframe_compute_like would be, other than a thing that satisfies the API we're defining.

kkraus14 commented 4 years ago

I don't understand what the return type of to_array_compute_like or to_dataframe_compute_like would be, other than a thing that satisfies the API we're defining.

There's two separate proposals here.

Lets say proposal number 1 is to_array_compute_like or to_dataframe_compute_like would guarantees that you receive an object, say df, that you can immediately perform computations with using a standard dataframe API (to be defined), i.e. df.column('a').sum(). It makes no guarantees about memory layout or even offering a path to the memory. This dataframe could be single threaded on the CPU, multithreaded on the CPU, running on a GPU, running on a cluster of 10k nodes, etc. The main guarantee here is that I can write the same high level array/dataframe code against the object returned from it and things work. Ideally any object that is returned via these APIs also implements proposal number 2 (see below). This would be ideal for a library like seaborn which only uses a high level dataframe API to do things like groupbys / aggregations / etc. before handing off to a lower level library like matplotlib. I have no issues with this proposal whatsoever and think it would be great for the ecosystem.

Proposal number 2 is to_array_interface_like or to_dataframe_interface_like, which guarantees you receive an object, say df, that implements a array / dataframe exchange protocol (to be defined) which specifies a memory layout of the object. Now this protocol would presumably be able to indicate that the memory is on the CPU, GPU, TPU, XPU, etc. so that a library can leverage that information to choose a codepath accordingly. Now for a library like say matplotlib, they always want CPU memory and know they won't be receiving large amounts of data so an implicit copy is okay, this poses a problem if they receive an object with memory on a GPU or TPU or XPU. What was brought up in the discussion today was adding an argument to the function along the lines of to_dataframe_interface_like(CPU=True) which then asks the library to explicitly give them an object in CPU memory. This object could then be read zero copy via something like np.asarray or pd.asdataframe which then allows a library like Matplotlib to continue to use the numpy C-API or something similar as it currently does today.

For library maintainers like Matplotlib, sklearn, etc. I think the ask is that instead of just doing np.asarray(obj) that they'd do np.asarray(obj.to_array_interface_like(CPU=True)).

amueller commented 4 years ago

I'm struggling with understanding proposal 1. If we have a dataframe object that implements the API we define, wouldn't to_dataframe_compute_like be a no-op?

datapythonista commented 4 years ago

Lets say proposal number 1 is to_array_compute_like or to_dataframe_compute_like

What we are planning is to work on that, after finishing with your proposal 2.

adding an argument to the function along the lines of to_dataframe_interface_like(CPU=True)

Assuming we don't allow having one dataframe column in CPU and another in GPU, or other per-column configuration, would make sense to use something like __dataframe__(device='cpu', allow_copy=False)? And then expect let's say an output like:

{'col1': obj_implementing_an_array_protocol_with_the_requested_specs,
 'col2': ...
}

Not proposing any details (returning a dict, the params...), just checking if I'm understanding your idea correctly.

Unrelated to this, I checked __arrow_array__ and if I'm not missing anything seems like it's not a protocol. my_object.__arrow_array__() is expected to return an Arrow array, (see the code here). So, it's just a convenient way to be able to write pyarrow.array(my_object) instead of my_object.to_array(). Which I guess means that supporting __arrow_array__ would require producers to depend on Arrow and perform the transformation instead of implementing a protocol. Let me know if anyone is more familiar with this and I'm missing something.

kkraus14 commented 4 years ago

I'm struggling with understanding proposal 1. If we have a dataframe object that implements the API we define, wouldn't to_dataframe_compute_like be a no-op?

Yes, but the original proposal at the top of this issue made it seem like part of the exchange protocol would be supporting numpy as well.

What we are planning is to work on that, after finishing with your proposal 2.

Sounds good to me.

Assuming we don't allow having one dataframe column in CPU and another in GPU, or other per-column configuration, would make sense to use something like __dataframe__(device='cpu', allow_copy=False)?

Yes the idea would be to do something like that, though I think we'd need to iterate on the args a bit more. allow_copy feels a bit overloaded where I'm unclear if that means I'm just getting a different copy of the memory so I can't possibly mutate it or I'm possibly paying a performance penalty to receive the memory. Say I have a XPU library, would __dataframe(device='xpu', allow_copy=True) return me the memory inplace or return a copy? If it would return a copy, it would have much higher performance than __dataframe(device='cpu', allow_copy=True), which feels a bit overloaded. I think the CPU device being the additional option here is going to be the common option among accelerator libraries, where it may make sense to fuse the allow_copy and device='cpu' options somehow to more clearly indicate: "An implicit slow accelerator --> host copy is okay here to give me CPU memory".

jorisvandenbossche commented 4 years ago

Regarding part 2 (the exchange protocol with actual specified memory layout):

Having a specific memory exchange protocol and an associated argument to specify a device where you want the memory

@kkraus14 if the requirement for specifying the device is a blocking issue for being able to use the Arrow C Data Interface, I think it would be very useful if you (or someone else with the knowledge / background about this use case) could bring this up on the arrow dev mailing list.

(even regardless of whether we would want to use this C interface in the consortium proposal or not)

kkraus14 commented 4 years ago

@kkraus14 if the requirement for specifying the device is a blocking issue for being able to use the Arrow C Data Interface, I think it would be very useful if you (or someone else with the knowledge / background about this use case) could bring this up on the arrow dev mailing list.

Sure, someone from my team or I will bring this up on the Arrow dev mailing list. For the folks here though, it would probably look something like how DLPack handles different devices: https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h#L38-L74

rgommers commented 4 years ago

I opened gh-30 to summarize this issue, previous issues and discussions in the weekly call we've had on this topic as concisely as I could. If there's any requirement or important question I missed, please point it out. I hope that gets us all on the same page, after which talking about proposed designs should be a lot easier.

maartenbreddels commented 4 years ago

Thanks Ralf and Marc, for #30 and #31 .

something that is not clear to me is if we want Keith's interface proposal, where the __dataframe__ would only return 'data' (a dict, similar to __array_interface__) or that it should be an object, the same object that would later on implement the full dataframe standard API.

jorisvandenbossche commented 4 years ago

Based on Ralf's comment at https://github.com/data-apis/dataframe-api/pull/30#issuecomment-693698425, I would assume we need some object that also has other methods (like to get the number of rows / columns, or to only get a specific column). But indeed, that's currently not very clear, so something we need to clarify.

maartenbreddels commented 4 years ago

Also, something that @datapythonista addressed, is chunking. If data sources (the data behind the dataframe) are too large to fit into memory, we need some way to get out chunks in an efficient way, so that we can dump it to disk, or a database.

One concern I have is how this chunking would play with the protocols we discussed, e.g an API like this.

for chunk in df['col'].chunked(max_length=1_000_000):
  ar = np.array(chunk)

Does not know in advance that the materialized array should end up in a numpy container, while its default implementation may decide to materialize to an Arrow array (one could argue the copy of Arrow to Numpy is cheap, but that's not the case for null values/masked arrays).

A very ugly API, that would give as much information in advance would be:

for chunk in df['col'].chunked(max_length=1_000_000, protocol='__array_interface__'):
  ar = np.array(chunk)

A second concern I have, is how that would deal with 'materializing' multiple columns at once, as this can be more efficient if there are inter-column dependencies (CPU cache, disk cache, file layout efficiency, reusing calculations).

for chunks in df[['col1', 'col2']].chunked(max_length=1_000_000, protocol='__array_interface__'):
  arrays = [np.array(chunk) for chunk in chunks]

(And of course the mandatory async generators)

rgommers commented 4 years ago

something that is not clear to me is if we want Keith's interface proposal, where the __dataframe__ would only return 'data' (a dict, similar to __array_interface__) or that it should be an object, the same object that would later on implement the full dataframe standard API.

Based on Ralf's comment at #30 (comment), I would assume we need some object that also has other methods (like to get the number of rows / columns, or to only get a specific column). But indeed, that's currently not very clear, so something we need to clarify.

Agreed, this is unclear right now. I suspect my last comment there is off-base, and we should clearly separate the data interchange from the object with a unified API. So I'd suggest we want __dataframe__ to be a method that returns a dict. With the essential metadata (which could include n_rows, n_columns) inside that dict.

jorisvandenbossche commented 4 years ago

So I'd suggest we want dataframe to be a method that returns a dict

But if that dict already includes as one of its keys the column's data, then it's not necessarily possible to only convert a subset of columns? (of course this depends on the specific layout of the dict, and whether the columns stored in the dict are itself then some object with a futher interface (delaying the conversion), or the actual data in the exchange format)

rgommers commented 4 years ago

Yes it depends. Let's turn that around: say it must be possible to convert only a subset of columns. Which then has implications for the details of the implementation.

jorisvandenbossche commented 3 years ago

@kkraus14 if the requirement for specifying the device is a blocking issue for being able to use the Arrow C Data Interface, I think it would be very useful if you (or someone else with the knowledge / background about this use case) could bring this up on the arrow dev mailing list.

Sure, someone from my team or I will bring this up on the Arrow dev mailing list. For the folks here though, it would probably look something like how DLPack handles different devices: https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h#L38-L74

@kkraus14 gentle reminder for this (I could also start a thread, but I basically know nothing about the details, requirements, what pointers mean if there is a device keyword, etc)

kkraus14 commented 3 years ago

Thanks for the nudge @jorisvandenbossche. Talking internally now to get someone to act as the point person in engaging the mailing list to detail out wants vs needs for the information relevant to handling device data as opposed to just CPU data.

jorisvandenbossche commented 3 years ago

@kkraus14 another ping for starting a discussion about device support / C Data Interface issue.

kkraus14 commented 3 years ago

Thanks for the ping, this slipped through the cracks on my end. Will do my best to push things forward.

rgommers commented 3 years ago

@kkraus14 just checking in on this device support in the Arrow protocol - was the discussion started?

kkraus14 commented 3 years ago

@kkraus14 just checking in on this device support in the Arrow protocol - was the discussion started?

It has not. Given the amount of discussion in the DLPack issue for now, I think it makes sense to iron out the details there and have something to point to before proposing anything to the Arrow protocol.

rgommers commented 3 years ago

That makes sense to me, thanks Keith.