Closed jakirkham closed 1 year ago
As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data.
This is different from numpy.ndarray.view
right? In the latter case, there is already an array instance which must have a well-defined dtype, it's not just a block of memory. This particular example sounds closest to frombuffer. I'm left wondering a little why the deserialization doesn't use the correct metadata immediately though - can you point to a concrete example?
One could do np.asarray(memoryview(buf)).view(fmt)
for example. Though yes there are similarities to np.frombuffer
Because the memory is allocated to receive the message before any of that information arrives (it needs to be written somewhere in memory). Only after the metadata and data are stored, can they go through the deserialization process
One could do
np.asarray(memoryview(buf)).view(fmt)
for example
Equivalent to np.asarray(memoryview(buf), dtype=fmt)
?
I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche. And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.
It seems like serialization falls under I/O, which is out of scope completely.
One could do
np.asarray(memoryview(buf)).view(fmt)
for exampleEquivalent to
np.asarray(memoryview(buf), dtype=fmt)
?
Not if dtype=...
means .astype(...)
. I think this gets back into our discussion earlier.
Maybe a short example helps? Imagine b
is received over the wire along with relevant metadata. The data is three float32
numbers (IOW Out[3]
is what we want).
In [1]: import numpy as np
In [2]: b = b"\x00\x00\x00\x00\x00\x00\x80?\x00\x00\x00@"
In [3]: np.asarray(memoryview(b)).view(np.float32)
Out[3]: array([0., 1., 2.], dtype=float32)
In [4]: np.asarray(memoryview(b), dtype=np.float32)
Out[4]:
array([ 0., 0., 0., 0., 0., 0., 128., 63., 0., 0., 0., 64.], dtype=float32)
I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche.
In our usual case it is not so much that the data is untyped, but the type doesn't necessarily match what it should. Taking the example above, we have...
In [6]: np.asarray(memoryview(b)).dtype
Out[6]: dtype('uint8')
IOW we often have something that is uint8
or int8
.
And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.
For clarity, am not looking to manipulate the underlying memory in any way and don't really care how it is represented. Am just trying to patch on the correct formatting. Another way to think of this would be altering the dtype
DLPack might use. Suppose one could hack around with the DLPack representation before it goes through the protocol, but that feels a bit clumsy.
It seems like serialization falls under I/O, which is out of scope completely.
It is certainly useful in I/O contexts (communication, file I/O, etc.). Though am not really looking for the protocol to handle the I/O portion or even serialization. Just the ability to perform this cast.
Thanks, that is helpful. The "it has the wrong dtype" has come up in at least one other place I think, using DLPack to transfer bool arrays - those weren't supported, so it was done as uint8.
I think the next step here is figure out how other array libraries do this (if they allow it).
Another use case for reinterpretation is ability to convert to and from the underlying byte representation of floating-point numbers.
This is common in the implementation of transcendental functions where you want to manipulate the underlying bits of a IEEE 754 floating-point number directly. Go, e.g., provides dedicated APIs for such reinterpretation (Float64bits
and Float64frombits
(albeit only operating on a single number)). JavaScript exposes an ArrayBuffer
from which can instantiated typed array views allowing floating-point <=> bits reinterpretation.
The ability to reinterpret the underlying memory (i.e., have a data "view") can certainly be useful in certain classes of numerical algorithms and when you want to vectorize operations. The ability to reinterpret without needing to perform a copy would afford performance benefits.
Currently, the only way to achieve reinterpretation according to the specification is via either (1) manual iteration and data copy or (2) a combination of __dlpack__
and from_dlpack
(see interchange), which may or may not involve data copy.
cc @seberg (in case you have thoughts on this one :)
For the use-case of reading blobs from the buffer protocol, I prefer the frombuffer
API. OTOH, I guess Dask cannot export buffers and it doesn't match well for a "reinterpret cast" of an existing array.
So there may be need for view
as well (which is a bit more generic I guess?), although it seems less important.
Think the main value of view
is it allows reinterpreting an existing array and knowing the end array type will be the same (the dtype
is ofc changed).
Whereas with frombuffer
, asarray
, etc., one needs to know the type of the array to call the right function. With a method, this confusion can be avoided.
As this proposal is currently without a champion, I'll go ahead and close.
In NumPy (and some other libraries) arrays have a method to
view
the data as anotherdtype
. This is different fromastype
as this taking data that may not be typed likebytes
orbytearray
and applying differentdtype
metadata on top of it. As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data. Though this can come up in other situations as well.cc @rgommers @kgryte (since we discussed this briefly earlier)