Open maupardh1 opened 9 months ago
My initial thought here is that we could represent it via an extension type, are complex32
and complex64
internally represented using 32-bit and 64-bit values respectively? Or are they larger than 4 and 8 byte values?
@lidavidm @pitrou @bkietz thoughts?
General issue about complex support:
But the mentioned mailing list thread has more discussion content.
This indeed could be a canonical extension type.
thank you for the replies! complex64 is 8 bytes in numpy - the dtype description is <c8
(8 bytes in little endian). Would a canonical extension type provide zero-copy interop?
Yes, a canonical extension type would provide zero-copy interop. The underlying representation could still be identical (a contiguous buffer of 8-byte values for example) which would have the underlying physical type of uint64 or something. Anything that recognizes the metadata and chooses to identify the extension type could then be able to interpret the values accordingly as complex64 values. I highly recommend putting together a proposal for these as canonical extension types.
got it - could the underlying type be binary(8)
, or a struct of {real: float32, imag: float32}
, or a fixedList
of 2 floats instead of uint64 and still provide zero-copy behavior? feels a bit weird to represent a complex64 (2 32-bit floats) as a 64bit int? even if the underlying type representation doesn't end up getting used for now?
x = np.array([1 + 2 * 1j, 3 + 4 * 1j], dtype=np.complex64)
y = pa.array(x, type=pa.complex64()).to_numpy(zero_copy=True)
x[0] = 2 + 3*1j
assert(y[0] == 2 + 3 * 1j)
doesn't throw (and so, address manipulation only). Is my thinking correct here?
how about making 2 primitive types - complex32 and complex64 - is that not an option?
are there any behavior differences between extension types and primitive types that could impact performance? (my use case is in the [1GB/s; 10GB/s] I/O throughput range on a single machine, so I'm looking for ways to write to disk or network socket at this order of speed - currently thinking of using feather).
last question: I'm ramping up on the docs for extension types - is there a proposal for one of the canonical extension types that you think was good? and maybe also the PR where the work was done? just so I get a sense for the size of the work.
thank you!
The only way for it to be zero-copy is if the underlying physical type matches the numpy representation. A struct of {real: float32, imag: float32}
wouldn't be represented identically (it would be two separate buffers of float32s). A fixedList
of floats makes sense and would provide the ability for zero-copy though. So that could work.
Personally, I think that representing as an extension type with fixed-size-list would be easier and better than creating entirely new primitive types. But that's just my thought, anyone else wanna chime in here?
Yes, I think fixed_size_list<float, 2>
is a good start.
got it. Is there an example of a good extension type PR/proposal that I could get inspiration from, and does the numpy zero-copy aspect?
FixedShapeTensor is a canonical extension type which also uses fixed_size_list
in its storage type. It was added in #8510, and mailing list discussion and vote is archived.
As for numpy zero copy: we don't currently have a canonical extension type which maps to a numpy type like this one would. In order to support this, it'd be necessary to modify functions like NumPyDtypeToArrow which maps from a numpy type to an arrow DataType. That might be sufficient for NumPyConverter::ConvertData
to recognize zero-copy compatibility
Describe the enhancement requested
Hello Arrow team,
First, I love Arrow - thank you so much for making this great project.
I am manipulating multi-dimensional array data (time-series like) that is produced from sensors as numpy arrays of type complex64. I would like to manipulate them in arrow for recording (feather and/or parquet formats) and distributed computing in the future (cu-df, dask, spark - most likely frameworks on top of arrow, but also cupy/scipy). This would also allow me to write column names and other schema metadata. I think it could be superior (and faster) than manipulating numpy arrays.
It would be great to just call
pa.array(np.array([1 + 2 * 1j, 3 + 4 * 1j], dtype=np.complex64), type=pa.complex64())
but that type doesn't exist in Arrow. I haven't found a way to zero-copy a complex64 numpy array into a pyarrow array (my understanding is that only primitive types support zero-copy between arrow and numpy, and pa.binary(8) or struct attempts on top of numpy views so far have resulted in copies). I would also need to read it back from a feather/parquet format and potentially convert it to a numpy array if needed, and land back on np.complex64.I think this has come up one or twice already: I found this thread: https://www.mail-archive.com/dev@arrow.apache.org/msg23352.html and this PR: https://github.com/apache/arrow/pull/10452 and thought I would also +1 this request, just in case.
If no first class support, do you see an alternative way to get zero-copy behavior?
Thanks!
Component(s)
Python