apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

[R] Use R sentinel values for missingness in addition to bitmask #19603

Open asfimport opened 6 years ago

asfimport commented 6 years ago

R uses sentinal values to indicate missingness within Atomic vectors (read arrays in Arrow parlance, AFAIK). 

Currently according to @wesm, the current value in the array in memory is undefined if the bitmap indicating missingness is set to 1. 

This will force R to copy and modify data whenever adopting Arrow data which has missingness present as a native vector.

If the value were written to the relevant sentinal values (INT_MIN for 32 bit integers, and NaN with payload 1954 for double precision floats) in addition to the bit mask, then R would be able to use Arrow as intended while not breaking any other systems.

Reporter: Gabriel Becker

Related issues:

Note: This issue was originally created as ARROW-3263. Please see the migration documentation for further details.

asfimport commented 6 years ago

Wes McKinney / @wesm: I would suggest defining optional metadata to indicate that a field's null values use the R sentinel value conventions. That way an R consumer, if they see the custom metadata, do not have to examine the valid bits and simply memcpy the values buffer for numbers. R, for its part, could roundtrip data to Arrow format with less serialization work

I don't think that using a specific value for null value slots is a good idea, since it would introduce brittleness into implementations, as there are many ways that a value could end up null. If you had to make a pass over the memory to "sanitize" the null slots to use a particular value, then that would require extra computing work in many cases.

asfimport commented 6 years ago

Gabriel Becker:

I would suggest defining optional metadata to indicate that a field's null values use the R sentinel value conventions.

This could be ok, see below.

That way an R consumer, if they see the custom metadata, do not have to examine the valid bits and simply memcpy the values buffer for numbers. R, for its part, could roundtrip data to Arrow format with less serialization work

Small clarifcation here, with ALTREP, R would able to operate in a read-only manner on Arrow data with zero copies, not with a single copy. That is what we want, I think.

I don't think that using a specific value for null value slots is a good idea, since it would introduce brittleness into implementations, as there are many ways that a value could end up null. If you had to make a pass over the memory to "sanitize" the null slots to use a particular value, then that would require extra computing work in many cases.

Well if it is optional, the question then becomes twofold in my mind:

  1. What is the default. Is Arrow going to produce R-compatible data unless an option is turned off in cases where people don't care about R and want the extra speed, or is it going to be incompatible by default.
  2. Will the core machinery either automate or offer tools to do this sanitizing pass or will people be forced to write their own.

    If the answer to 2. is that that is left to application owners, the result of that in practice would be that the vast majority of arrow data would not be R compatible, which I suspect would dramatically curtail R-user's interest in and ability to use the Arrow ecosystem.

asfimport commented 6 years ago

Wes McKinney / @wesm:

Will the core machinery either automate or offer tools to do this sanitizing pass or will people be forced to write their own.

I think it'd be reasonable to have code to prepare data for consumption with R in the common libraries, so a user of the common libraries (e.g. Java/Python/C++/Ruby) could emit the R metadata in IPC payloads so that the R receiver could do less work.

AFAICT this would only apply to numeric and integer/factor vectors, and possibly also boolean. Strings would have to be put into / looked up in the global string hash table

cc @romainfrancois

asfimport commented 5 years ago

Wes McKinney / @wesm: Circling back on this discussion from a year ago.

Now that we have arrow::ExtensionType in C++, technically we could introduce containers for R data that has not been serialized to one of the built-in Arrow types. I'm not sure what you would do with such data, but technically it is now possible to faithfully transport unmodified R data end to end through Arrow's IPC / RPC machinery

asfimport commented 5 years ago

Wes McKinney / @wesm: In Python, note that we don't require any memory copying when converting between null-as-sentinels from pandas to Arrow format. Only a validity/null bitmap has to be allocated.

Here's an example


In [1]: arr = np.random.randn(1000000)                                                                                                   

In [2]: arr[::2] = np.nan                                                                                                                

In [3]: arrow_arr = pa.array(arr, from_pandas=True)                                                                                      

In [4]: arrow_arr.null_count                                                                                                             
Out[4]: 500000

In [5]: pa.total_allocated_bytes()                                                                                                       
Out[5]: 125056

Here arrow_arr has two buffers for its memory layout:

asfimport commented 5 years ago

Wes McKinney / @wesm: @romainfrancois do you know if it's possible to achieve this in R?