apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[Python] support for complex64 and complex128 as primitive types for zero-copy interop with numpy #39753

Open maupardh1 opened 9 months ago

maupardh1 commented 9 months ago

Describe the enhancement requested

Hello Arrow team,

First, I love Arrow - thank you so much for making this great project.

I am manipulating multi-dimensional array data (time-series like) that is produced from sensors as numpy arrays of type complex64. I would like to manipulate them in arrow for recording (feather and/or parquet formats) and distributed computing in the future (cu-df, dask, spark - most likely frameworks on top of arrow, but also cupy/scipy). This would also allow me to write column names and other schema metadata. I think it could be superior (and faster) than manipulating numpy arrays.

It would be great to just call pa.array(np.array([1 + 2 * 1j, 3 + 4 * 1j], dtype=np.complex64), type=pa.complex64()) but that type doesn't exist in Arrow. I haven't found a way to zero-copy a complex64 numpy array into a pyarrow array (my understanding is that only primitive types support zero-copy between arrow and numpy, and pa.binary(8) or struct attempts on top of numpy views so far have resulted in copies). I would also need to read it back from a feather/parquet format and potentially convert it to a numpy array if needed, and land back on np.complex64.

I think this has come up one or twice already: I found this thread: https://www.mail-archive.com/dev@arrow.apache.org/msg23352.html and this PR: https://github.com/apache/arrow/pull/10452 and thought I would also +1 this request, just in case.

If no first class support, do you see an alternative way to get zero-copy behavior?

Thanks!

Component(s)

Python

zeroshade commented 9 months ago

My initial thought here is that we could represent it via an extension type, are complex32 and complex64 internally represented using 32-bit and 64-bit values respectively? Or are they larger than 4 and 8 byte values?

@lidavidm @pitrou @bkietz thoughts?

jorisvandenbossche commented 9 months ago

General issue about complex support:

But the mentioned mailing list thread has more discussion content.

pitrou commented 9 months ago

This indeed could be a canonical extension type.

maupardh1 commented 9 months ago

thank you for the replies! complex64 is 8 bytes in numpy - the dtype description is <c8 (8 bytes in little endian). Would a canonical extension type provide zero-copy interop?

zeroshade commented 9 months ago

Yes, a canonical extension type would provide zero-copy interop. The underlying representation could still be identical (a contiguous buffer of 8-byte values for example) which would have the underlying physical type of uint64 or something. Anything that recognizes the metadata and chooses to identify the extension type could then be able to interpret the values accordingly as complex64 values. I highly recommend putting together a proposal for these as canonical extension types.

maupardh1 commented 9 months ago

got it - could the underlying type be binary(8), or a struct of {real: float32, imag: float32}, or a fixedList of 2 floats instead of uint64 and still provide zero-copy behavior? feels a bit weird to represent a complex64 (2 32-bit floats) as a 64bit int? even if the underlying type representation doesn't end up getting used for now?

doesn't throw (and so, address manipulation only). Is my thinking correct here?

thank you!

zeroshade commented 9 months ago

The only way for it to be zero-copy is if the underlying physical type matches the numpy representation. A struct of {real: float32, imag: float32} wouldn't be represented identically (it would be two separate buffers of float32s). A fixedList of floats makes sense and would provide the ability for zero-copy though. So that could work.

Personally, I think that representing as an extension type with fixed-size-list would be easier and better than creating entirely new primitive types. But that's just my thought, anyone else wanna chime in here?

pitrou commented 9 months ago

Yes, I think fixed_size_list<float, 2> is a good start.

maupardh1 commented 9 months ago

got it. Is there an example of a good extension type PR/proposal that I could get inspiration from, and does the numpy zero-copy aspect?

bkietz commented 9 months ago

FixedShapeTensor is a canonical extension type which also uses fixed_size_list in its storage type. It was added in #8510, and mailing list discussion and vote is archived.

As for numpy zero copy: we don't currently have a canonical extension type which maps to a numpy type like this one would. In order to support this, it'd be necessary to modify functions like NumPyDtypeToArrow which maps from a numpy type to an arrow DataType. That might be sufficient for NumPyConverter::ConvertData to recognize zero-copy compatibility