How to consume a single buffer & connection to array interchange

data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard

https://data-apis.org/dataframe-api/draft/index.html

MIT License

102 stars 20 forks source link

How to consume a single buffer & connection to array interchange #39

Open rgommers opened 3 years ago

rgommers commented 3 years ago

For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol, __array_interface__, __cuda_array_interface__, __array__ and __arrow_array__ all exist, and are still complicated.

For what a buffer is, currently it's only a data pointer (ptr) and a size (bufsize) which together describe a contiguous block of memory, plus a device attribute (__dlpack_device__) and optionally DLPack support (__dlpack__). One open question is:

Should a buffer support strides? This helps describe actual memory layout, e.g. Pandas can have strided columns (see https://github.com/data-apis/dataframe-api/pull/38#discussion_r586029288)

The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:

def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
    """
    """
    if col.offset != 0:
        raise NotImplementedError("column.offset > 0 not handled yet")

    if col.describe_null not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")

    # Handle the dtype
    _dtype = col.dtype
    kind = _dtype[0]
    bitwidth = _dtype[1]
    if _dtype[0] not in (0, 1, 2, 20):
        raise RuntimeError("Not a boolean, integer or floating-point dtype")

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    # No DLPack yet, so need to construct a new ndarray from the data pointer
    # and size in the buffer plus the dtype on the column
    _buffer = col.get_data_buffer()
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))

    # NOTE: `x` does not own its memory, so the caller of this function must
    #       either make a copy or hold on to a reference of the column or
    #       buffer! (not done yet, this is pretty awful ...)
    x = np.ctypeslib.as_array(data_pointer,
                              shape=(_buffer.bufsize // (bitwidth//8),))

    return x

From https://github.com/data-apis/dataframe-api/pull/38#pullrequestreview-602372092 (@kkraus14 & @rgommers):

_In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well._

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

___cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer._

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

From https://github.com/data-apis/dataframe-api/pull/38#discussion_r574712313 (@jorisvandenbossche):

_I personally think it would be useful to keep those existing interface methods (or array, or __arrow_array_). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers.

Alternative/extension to the current design

We could change the plain memory description + __dlpack__ to:

Implementations MUST support a memory description with ptr, bufsize, and device
Implementations MAY support buffers in their native format (e.g. add a native enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - __arrow_array__ or __array__)
Implementations MAY support any exchange protocol (DLPack, __cuda_array_interface__, buffer protocol, __array_interface__).

(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is

The advantage of (2) and (3) are that they have the most hairy issue already solved, and will likely be faster.

And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).

What is missing for dealing with memory buffers

A summary of why this is hard is:

Underlying implementations are not compatible. E.g., NumPy doesn't support variable length strings or bit masks, Arrow does not support strided arrays or byte masks.
DLPack is the only protocol with device support, but it does not support all dtypes that are needed.

So what we are aiming for (ambitiously) is:

Something flexible enough to be a superset of NumPy and Arrow, with full device support.
In pure Python

The "holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well" seems necessary for supporting the raw memory description. This probably means that (a) the Buffer object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation.

kkraus14 commented 3 years ago

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

The way we handle this in cudf is we have a cudf.Buffer class which is non-owning. It has a ._owner attribute which holds a reference to an object that is responsible for managing the lifetime of the actual memory. For our owning memory, we have our rmm (RAPIDS Memory Manager) library, and it's associated class of rmm.DeviceBuffer which uses a C++ smart pointer to a C++ class under the hood to control memory lifetime. Regardless, we haven't had any issues with the cudf.Buffer class and using ._owner to hold a reference to an object that guarantees the lifetime of the memory referenced by the Buffer instance.

__cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer.

Actually in our case it directly is. We often hold a reference to a cupy array or numba device array or pytorch tensor as the underlying owner of memory underneath a Buffer. We also directly implement __cuda_array_interface__ in rmm.DeviceBuffer (https://github.com/rapidsai/rmm/blob/branch-0.19/python/rmm/_lib/device_buffer.pyx#L115-L124) and cudf.Buffer (https://github.com/rapidsai/cudf/blob/branch-0.19/python/cudf/cudf/core/buffer.py#L75-L84) so that buffers can be directly consumed by the other array libraries and then viewed as a different type.

I.E. we do this in a few places with string columns, where we use __cuda_array_interface__ to share the offsets buffer into something like CuPy and then view it as int32 to do some logical operations on it.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

Yes, for numerical and other columns that can be represented by a single buffer it would be nice to support __dlpack__ directly, but what I'm thinking for a more general column interchange protocol is that it would be composed of a dlpack usage per buffer. I.E. for a string column with a nullmask buffer, a character buffer, and an offsets buffer, we could imagine exchanging the 3 buffers using 3 dlpack instances. The main problem I can envision in using this approach is that certain libraries may have columns be the memory owning entity as opposed to direct buffers, which could then make it difficult for them to get out to separate reference counted buffer objects for controlling the memory lifetime separately.

rgommers commented 3 years ago

Regardless, we haven't had any issues with the cudf.Buffer class and using ._owner to hold a reference to an object that guarantees the lifetime of the memory referenced by the Buffer instance.

That makes sense. This attribute:

        owner : object, optional
            Python object to which the lifetime of the memory
            allocation is tied. If provided, a reference to this
            object is kept in this Buffer.

is also needed in this protocol - except it can't be optional.

We often hold a reference to a cupy array or numba device array or pytorch tensor as the underlying owner of memory underneath a Buffer.

That looks similar (the principle, not the code) to how numpy deals with __array_interface__. In particular (stripping out all null checks etc, see multiarray/ctors.c):

PyArray_FromInterface(PyObject *origin)

    /* Get data buffer from interface specification */
    attr = _PyDict_GetItemStringWithError(iface, "data");

    /* Case for data access through pointer */
    if (attr && PyTuple_Check(attr)) {
        PyObject *dataptr;

        dataptr = PyTuple_GET_ITEM(attr, 0);
        if (PyLong_Check(dataptr)) {
            data = PyLong_AsVoidPtr(dataptr);

        base = origin;

    ret = (PyArrayObject *)PyArray_NewFromDescrAndBase(
            &PyArray_Type, dtype,
            n, dims, NULL, data,
            dataflags, NULL, base);

Here origin is equivalent to the owner in cuDF, and a reference to it is stored in the base attribute of a numpy array.

jorisvandenbossche commented 3 years ago

Alternative/extension to the current design

Those 3 bullet points of (possibly) supported APIs, is this for the Buffer level? Because it's probably also useful to have that on the Column level as well.

E.g., NumPy doesn't support variable length strings or bit masks, Arrow does not support strided arrays or byte masks.

Might not be that relevant, but: Arrow does support byte masks in some way, I.e. as boolean arrays (it just can't use it as nulls to compute anything with it, but you can store it). You may not be able to use byte masks in a plain Arrow array, but at the same time I don't think you can use byte masks in numpy's __array_interface__ either? So also for numpy you need to treat this as a separate array/buffer, so the same can be done for Arrow (eg in Arrow it could be a StructArray with a values field and a mask field)

kkraus14 commented 3 years ago

You may not be able to use byte masks in a plain Arrow array, but at the same time I don't think you can use byte masks in numpy's __array_interface__ either? So also for numpy you need to treat this as a separate array/buffer, so that same can be done for Arrow (eg in Arrow it could be a StructArray with a values field and a mask field)

In __array_interface__ and __cuda_array_interface__ there's a mask attribute that is another object that exposes __array_interface__ or __cuda_array_interface__ respectively that allows for a byte mask.

jorisvandenbossche commented 3 years ago

Ah, cool, I didn't know that. Now, in practice it seems that numpy doesn't use that? (a numpy masked array doesn't return it, and if you create a numpy array from an object returning a dict with a "mask" key, nothing happens with it)

rgommers commented 3 years ago

Now, in practice it seems that numpy doesn't use that? (a numpy masked array doesn't return it, and if you create a numpy array from an object returning a dict with a "mask" key, nothing happens with it)

Indeed, the mask attribute has been in the docs since they were imported into the numpy repo in 2008, but it was never used AFAICT. There was once a plan for a proper masked array implementation in numpy core, rather than the bolted on bit in numpy.ma I believe.

rgommers commented 3 years ago

We could change the plain memory description + __dlpack__ to:

1. Implementations MUST support a memory description with `ptr`, `bufsize`, and device

2. Implementations MAY support buffers in their native format (e.g. add a `native` enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - `__arrow_array__` or `__array__`)

3. Implementations MAY support any exchange protocol (DLPack, `__cuda_array_interface__`, buffer protocol, `__array_interface__`).

Summarizing some of the main comments:

It may make sense to have these protocols on the column level too. that would make life easier for the consumer. That would have minor differences in execution semantics (memory may get released at different times), but it would help the majority of use cases.
Using only DLPack at the buffer level seems attractive. But, it doesn't support all dtypes, so may complicate things.
A reference implementation will be useful; at least at the buffer level most libraries will want/need to do the same thing, and it's error-prone code.

Action for me: create two implementations, a minimal and a maximal one, so it's easier to compare.

kkraus14 commented 3 years ago

Using only DLPack at the buffer level seems attractive. But, it doesn't support all dtypes, so may complicate things.

What dtypes does it not support that it needs to at the buffer level? I thought we had previously decided that buffers were untyped since they could be interpreted in different ways? Even if they're typed, I would assume they're int or uint types which dlpack supports.

jorisvandenbossche commented 3 years ago

Using only DLPack at the buffer level seems attractive. But, it doesn't support all dtypes, so may complicate things.

But buffers don't have a dtype?

rgommers commented 3 years ago

Oh right, resolution of that was "due to variable length strings, buffers must be untyped (dtype info lives on the column not the buffer)". There's still an inconsistency in that __dlpack__ does contain the dtype, so what we then should do is create some convention (e.g. always use intxx dtypes at the buffer level, with the bit width corresponding to that of the dtype at the column level).

kkraus14 commented 3 years ago

And this goes back to the strides discussion, that if buffers are untyped, does it really make sense to have strides on them versus controlling that on the Column level? I.E. take something like a theoretical uint128 array, where it would have a data buffer. If I knew the max value of the array was 10, then I could have a typecast operation to basically any uint or int type be a zero-memory op and instead just set a new stride. I would argue this makes a lot more sense to handle at the Column level as opposed to the Buffer level.

jorisvandenbossche commented 3 years ago

Now we are considering going all in on DLPack at the buffer level, a potential alternative to consider would be using the actual Arrow C Data Interface instead? (at the column level)

When describing/discussing the requirements in https://github.com/data-apis/dataframe-api/pull/35, an argument against the Arrow C Data Interface is that we were looking for a Python __dataframe__ interface rather than a C-level interface. But __dataframe__ itself would still be the same python interface as we have been discussing, but the actual interchange of column (chunks) could still use the Arrow C interface (since we consider DLPack at this level, it seems we are fine with a non-pure Python exchange interface).

In the end, for primitive arrays, DLPack and Arrow C Data interface are basically the same (it's almost the same C struct, in arrow there are only some additional null pointers for child/dictionary arrays that are not needed for a primitive array; eg implementing conversion to/from numpy for primitive arrays wouldn't be harder compared to DLPack). But, Arrow C Data interface natively supports the more complex nested/multi-buffer cases we also want to support (eg variable length strings, categoricals).

What's missing for using the Arrow C Data interface is:

an official Python interface to access the pointers (there are currently semi-private _export_to_c/_import_from_c, but those could be made official in a method like __arrow_c_data__, similar as __dlpack__)
support for devices (as discussed before)

Both those points have been discussed / worked out for DLPack in context of the Array API (I think? didn't fully follow this), so transferring what has been learned from that to add the same capabilities to the Arrow C data interface could be a nice addition to Arrow and would IMO be a great outcome of the consortium efforts.

kkraus14 commented 3 years ago

I think it was decided that strided data would be supported which the Arrow C data interface doesn't support as well which we'd need to propose adding.

Another key point is with the Arrow C data interface is that there isn't a way to control lifetime of individual buffers versus the column. I.E. for a string column say I use the __arrow_c_data__ protocol to go between libraries and then go down to the offsets buffer with both the original string column and the interchanged string column both going out of scope. I need a way to guarantee the memory for the offsets buffer stays alive, and the only way to do that is to make sure the release callback isn't called in this situation.

jorisvandenbossche commented 3 years ago

I think it was decided that strided data would be supported which the Arrow C data interface doesn't support as well which we'd need to propose adding.

I looked back at the last meeting notes, and there are some notes about how to implement it (should the strides live at buffer or column level? etc), and not much about reasons to support it (but maybe we didn't capture everything in the notes). At https://github.com/data-apis/dataframe-api/pull/38#discussion_r587686519 there is a bit of discussion about it, and a reason is the current pandas internals that would not be zero-copy in certain cases (for DataFrames constructed from a 2D numpy array, and that has not yet been copied / operated on after construction). Personally I don't find that a blocking issue for not supporting strides.

Another key point is with the Arrow C data interface is that there isn't a way to control lifetime of individual buffers versus the column ... and the only way to do that is to make sure the release callback isn't called in this situation.

And is this problematic to ensure? (honest question, this is out of my comfort zone)

kkraus14 commented 3 years ago

And is this problematic to ensure? (honest question, this is out of my comfort zone)

I would say it adds some complexity but is manageable, but moreso on devices like GPUs with much more limited memory available, every extra byte counts where we'd want to as aggressively release buffers / free memory as possible.

rgommers commented 3 years ago

Now we are considering going all in on DLPack at the buffer level, a potential alternative to consider would be using the actual Arrow C Data Interface instead? (at the column level)

Yes, we should definitely reconsider, because it's better to reuse than reinvent. If it's not a good idea, we should document very clearly why.

A few thoughts:

I think the outcome of the last discussion was (or at least my take away): if buffers are untyped, which they must be because of having to support variable-length strings, then they cannot be strided. So we should specify a copy
There is a long-standing action for @kkraus14 et al. to bring up the lack of device support on the Arrow mailing list. Didn't happen yet, but still of interest (I think?)
No support for numpy style masks in Arrow was another issue IIRC. The conclusion we came to (I need to look for the reference) was we need to support the set of in-memory data representations that includes both the Arrow and the NumPy native representations.

(2) and (3)

and a reason is the current pandas internals that would not be zero-copy in certain cases (for DataFrames constructed from a 2D numpy array, and that has not yet been copied / operated on after construction). Personally I don't find that a blocking issue for not supporting strides.

It's a bit unfortunate, but it may be worth giving up on striding indeed if it brings us other benefits.

What's missing for using the Arrow C Data interface is:

an official Python interface to access the pointers (there are currently semi-private _export_to_c/_import_from_c, but those could be made official in a method like __arrow_c_data__, similar as __dlpack__)

Yes, that isn't too hard.

support for devices (as discussed before)

Both those points have been discussed / worked out for DLPack in context of the Array API (I think? didn't fully follow this),

The DLPack device support model already existed, it was the CUDA/ROCm stream support that we figured out how to do. Arrow can learn from DLPack there, but all it would use is CUDA/ROCm I believe. The vector lanes and bit-width stuff (and striding) which DLPack has because of support for FPGAs and other devices probably is too foreign to mix in with Arrow.

so transferring what has been learned from that to add the same capabilities to the Arrow C data interface could be a nice addition to Arrow and would IMO be a great outcome of the consortium efforts.

Yes, I do agree with that.

rgommers commented 3 years ago

I was preparing some more code changes, but summarizing these questions well for tomorrow's call will be more useful I think.

kkraus14 commented 3 years ago

2. There is a long-standing action for @kkraus14 et al. to bring up the lack of device support on the Arrow mailing list. Didn't happen yet, but still of interest (I think?)

Yea there's still more work for me to do here 😅. We've ironed out the semantics on the Python side for dlpack, but we need to do the same for a C interface now. Once we have the learnings from that my plan was to point to that for making a proposal to the Arrow C data interface.

jorisvandenbossche commented 3 years ago

No support for numpy style masks in Arrow was another issue IIRC. The conclusion we came to (I need to look for the reference) was we need to support the set of in-memory data representations that includes both the Arrow and the NumPy native representations.

Since we would only use this for the column-level interchange (and not the full dataframe), and so we still have a python level API layer to communicate about how missing values are stored, I think we could agree on some "standard" way to use a StructArray with 2 arrays (values, mask) to represent a boolean-masked array. Of course, that means that there is no direct mapping between the column dtype and the physical array type to expect, but that you in addition need to check how nulls are stored. But also when using the Buffer python classes, you would need to check this information to know how to interpret the buffers (so that's not necessarily much different when using a StructArray vs plain Array with bitmask).

rgommers commented 3 years ago

Yes, I agree the boolean mask support can be added via a convention.

rgommers commented 3 years ago

Here is a summary of the "use Arrow C Data Interface" option:

Boolean masks

Arrow does not support them natively. However, it does support a boolean dtype, so it is possible to represent a column with a boolean mask in the Arrow C Data Interface through a naming convention. That will break the following specification in the ArrowArray.buffers spec:

_Mandatory ... The pointer to the null bitmap buffer, if the data type specifies one, MAY be NULL only if ArrowArray.null_count is 0._

This should be okay, given that we still have a Python-only API so we can define the convention that a __dataframe__ consumer must use to interpret the data (but, it will lead to implementation issues, see last section of this summary).

Memory management at the buffer level

Quoting Keith: Another key point is with the Arrow C data interface is that there isn't a way to control lifetime of individual buffers versus the column ... and the only way to do that is to make sure the release callback isn't called in this situation. ... it adds some complexity but is manageable, but moreso on devices like GPUs with much more limited memory available, every extra byte counts where we'd want to as aggressively release buffers / free memory as possible.

Whether or not to manage memory at the column or the buffer level is a choice that must be made in the standard; it cannot be left up to the implementing libraries. Reason: managing at the buffer level only helps if both libraries do that, otherwise all memory stays alive anyway - and there isn't even a buffer-level deleter to call unless we add one.

I think this would be a significant change to the Arrow C Data Interface.

Device support

Assumption: we want to only support CPU and GPU (CUDA, ROCm) for now. But make sure it is extensible to other device types later.

Current design, which uses a __dlpack_device__ at the buffer level, is okay as the Python-level interface.

Next steps for device support for Arrow:

Update the C interface of DLPack for stream exchange, mirroring what we did for the Python API. This is https://github.com/dmlc/dlpack/issues/65
Make a proposal to Arrow

Strided buffers

Controlling strides at the buffer level isn't ideal, because it will interfere with variable-length strings.
We are probably okay (?) with making the assumption that a column is contiguous in memory (less efficient support for numpy-based dataframes and for row-based dataframes, they need an extra copy).
That does mean individual buffers could be strided (e.g. nested dtypes), but then the striding may be controlled at the column level.

The less efficient support for row-based dataframes is a larger downside than the numpy-based ones, because row-based dataframes will always have columns that are strided in memory. It'd be much nicer if Arrow (or our protocol, whether based on Arrow or not) supporting striding.

How would column exchange now actually work

What's missing for using the Arrow C Data interface is:

an official Python interface to access the pointers (there are currently semi-private _export_to_c/_import_from_c, but those could be made official in a method like __arrow_c_data__, similar as __dlpack__)

I'm not sure this is necessary - in __dlpack__ we do not have this. There, __dlpack__ is the whole Python API, you cannot dig into it and get out raw pointers. Everything happens through the C API.

There's still a problem to exactly mirror this for dataframes. We still want a superset of what the Arrow C Data Interface offers: boolean masks, device support, perhaps deleters at the buffer level. So we're talking about what is basically either a fork or a v2 of the Arrow C Data Interface.

It may be helpful to sketch the calling code:

# User calls
df = consumer_lib.from_dataframe(df_other, columns=['A', 'B'])

# What happens inside `from_dataframe`:
dfobj = df_other.__dataframe__().get_columns(columns)

cols = dict()
for name in dfobj.column_names():
    cols[name] = convert_column_to_my_native_format(dfobj.get_column_by_name(name))

# Instantiate our own dataframe:
df_out = mylib.DataFrame(cols)

# That native conversion function can use Arrow (maybe)
def convert_column_to_my_native_format(column):
    # Check if null representation is supported by Arrow natively
    if column.describe_null() == 4:  # byte mask (in future, use enum)
        # handle convention in custom implementation, cannot rely directly on Arrow here
        ...

    # This function is all compiled code
    return mylib._use_arrow_c_interface(column)

jorisvandenbossche commented 3 years ago

The less efficient support for row-based dataframes is a larger downside

Do we know of row-based dataframe libraries in Python that can give access to columns as a strided array? (apart from numpy recarrays)

an official Python interface to access the pointers

I'm not sure this is necessary - in __dlpack__ we do not have this.

Isn't that basically what __dlpack__ is? (access to the pointer for C/C++ code, but only wrapped in a PyCapsule?)

It may be helpful to sketch the calling code:

Thanks, concrete code snippets are always helpful! ;) To be clear: it would basically look exactly the same with dlpack, I think? (meaning: the question about how this would be used in practice applies to dlpack as well)

rgommers commented 3 years ago

Do we know of row-based dataframe libraries in Python that can give access to columns as a strided array? (apart from numpy recarrays)

I'd expect Koalas to be able to do this (disclaimer, I don't know about its internals). Or maybe one of the Ibis backends.

rgommers commented 3 years ago

Isn't that basically what __dlpack__ is? (access to the pointer for C/C++ code, but only wrapped in a PyCapsule?)

Well, there's an opaque object that's not meant to be unpacked or even seen by the user. All a user would be is call x2 = from_dlpack(x) where x supports DLPack.

To be clear: it would basically look exactly the same with dlpack, I think? (meaning: the question about how this would be used in practice applies to dlpack as well)

Yes. My point was that there's conventions (like for boolean mask) and extras that are TBD (device support, buffer-level deleters), so we can't say "this uses the Arrow C Data Interface, so you can use your existing implementation to parse it". There's no working C code to reuse, there's only the structs in the Arrow spec that we'd take over.

jorisvandenbossche commented 3 years ago

I'd expect Koalas to be able to do this

Since Koalas is Spark under the hood, I suspect they would use Arrow for efficient Spark->Python data tranfer (but don't know about its internals neither).

Isn't that basically what __dlpack__ is? (access to the pointer for C/C++ code, but only wrapped in a PyCapsule?)

Well, there's an opaque object that's not meant to be unpacked or even seen by the user. All a user would be is call x2 = from_dlpack(x) where x supports DLPack.

But meant to be unpacked by library authors? So we still need the same for a potential Arrow C Data interface?

There's no working C code to reuse, there's only the structs in the Arrow spec that we'd take over.

To better understand what you are referring to / looking for: what would be the equivalent for DLPack for this? Does it have a standalone C implementation that can be reused / exposed as a python library?

jorisvandenbossche commented 3 years ago

Regarding the boolean masks, I would personally not reuse the bitmask buffer in the vector of buffers and make this a boolean array. As that would make the struct no longer ingestible by Arrow. But rather, we can use an existing / valid construct from the Arrow type system to represent a masked array (eg a StructArray with 2 non-nullable fields (values array and mask array) can zero-copy represent a numpy-type masked array).

rgommers commented 3 years ago

But meant to be unpacked by library authors?

Only in C or C++. You vendor dlpack.h, and instantiate your array/tensor from the contents of the C structs.

So we still need the same for a potential Arrow C Data interface?

It kind of is "let's not have a Python API for this". We'd either have to write that reusable library, or force all dataframe library authors to write C. E.g. Modin is now pure Python; with __dlpack__ at the buffer level it can remain that way (it relies on e.g. numpy to deal with memory mgmt via DLPack) while with Arrow at the column level, I don't think it can.

rgommers commented 3 years ago

The very short summary of the discussion on this was:

there are advantages to using the Arrow C Data Interface, with extensions - it's understood that C implementations are needed (e.g. Pandas would be helped by having NumPy understand the Arrow C Data Interface)
for now we'll do two things: (1) make the current Python prototype more feature-complete (in particular, add categorical and string dtype support), and (2) sketch out the alternative based on the Arrow C Data Interface

rgommers commented 3 years ago

Re memory management: also @aregm had very clear use cases for having memory managed at the buffer level. This looks like a must-have.