Open asfimport opened 6 years ago
Wes McKinney / @wesm: I would suggest defining optional metadata to indicate that a field's null values use the R sentinel value conventions. That way an R consumer, if they see the custom metadata, do not have to examine the valid bits and simply memcpy the values buffer for numbers. R, for its part, could roundtrip data to Arrow format with less serialization work
I don't think that using a specific value for null value slots is a good idea, since it would introduce brittleness into implementations, as there are many ways that a value could end up null. If you had to make a pass over the memory to "sanitize" the null slots to use a particular value, then that would require extra computing work in many cases.
I would suggest defining optional metadata to indicate that a field's null values use the R sentinel value conventions.
This could be ok, see below.
That way an R consumer, if they see the custom metadata, do not have to examine the valid bits and simply memcpy the values buffer for numbers. R, for its part, could roundtrip data to Arrow format with less serialization work
Small clarifcation here, with ALTREP, R would able to operate in a read-only manner on Arrow data with zero copies, not with a single copy. That is what we want, I think.
I don't think that using a specific value for null value slots is a good idea, since it would introduce brittleness into implementations, as there are many ways that a value could end up null. If you had to make a pass over the memory to "sanitize" the null slots to use a particular value, then that would require extra computing work in many cases.
Well if it is optional, the question then becomes twofold in my mind:
Will the core machinery either automate or offer tools to do this sanitizing pass or will people be forced to write their own.
If the answer to 2. is that that is left to application owners, the result of that in practice would be that the vast majority of arrow data would not be R compatible, which I suspect would dramatically curtail R-user's interest in and ability to use the Arrow ecosystem.
Wes McKinney / @wesm:
Will the core machinery either automate or offer tools to do this sanitizing pass or will people be forced to write their own.
I think it'd be reasonable to have code to prepare data for consumption with R in the common libraries, so a user of the common libraries (e.g. Java/Python/C++/Ruby) could emit the R metadata in IPC payloads so that the R receiver could do less work.
AFAICT this would only apply to numeric and integer/factor vectors, and possibly also boolean. Strings would have to be put into / looked up in the global string hash table
cc @romainfrancois
Wes McKinney / @wesm: Circling back on this discussion from a year ago.
Now that we have arrow::ExtensionType
in C++, technically we could introduce containers for R data that has not been serialized to one of the built-in Arrow types. I'm not sure what you would do with such data, but technically it is now possible to faithfully transport unmodified R data end to end through Arrow's IPC / RPC machinery
Wes McKinney / @wesm: In Python, note that we don't require any memory copying when converting between null-as-sentinels from pandas to Arrow format. Only a validity/null bitmap has to be allocated.
Here's an example
In [1]: arr = np.random.randn(1000000)
In [2]: arr[::2] = np.nan
In [3]: arrow_arr = pa.array(arr, from_pandas=True)
In [4]: arrow_arr.null_count
Out[4]: 500000
In [5]: pa.total_allocated_bytes()
Out[5]: 125056
Here arrow_arr
has two buffers for its memory layout:
values buffer
the values buffer is a zero-copy reference to the from-pandas NumPy array. The validity bitmap must be allocated and populated according to the Arrow format, hence only ~125K memory has to be allocated rather than ~8+MB as with creating a new double array with 1e6 values
Wes McKinney / @wesm: @romainfrancois do you know if it's possible to achieve this in R?
R uses sentinal values to indicate missingness within Atomic vectors (read arrays in Arrow parlance, AFAIK).
Currently according to @wesm, the current value in the array in memory is undefined if the bitmap indicating missingness is set to 1.
This will force R to copy and modify data whenever adopting Arrow data which has missingness present as a native vector.
If the value were written to the relevant sentinal values (INT_MIN for 32 bit integers, and NaN with payload 1954 for double precision floats) in addition to the bit mask, then R would be able to use Arrow as intended while not breaking any other systems.
Reporter: Gabriel Becker
Related issues:
Note: This issue was originally created as ARROW-3263. Please see the migration documentation for further details.