apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.45k stars 3.52k forks source link

Feature request: R-style NA values #1984

Closed st-pasha closed 6 years ago

st-pasha commented 6 years ago

In R, boolean/integer NA values are stored as the largest negative value representable by the corresponding C type (i.e. INT_MIN). Likewise, numeric NA values are represented as NaNs with a specific double payload value.

I understand that Arrow already has a way to represent NAs generically, via a numpy-style bitmask marking the null values. And when passing the data between R and numpy, it is necessary to translate the data between these two representations. Thus, the "zero-copy data transfer" promise cannot possibly be fulfilled in this case. On the other hand it is possible to zero-copy transfer data between different numpy instances (eg. across processes, or by saving the data to a file), because the Arrow format implements numpy NA model natively.

What is currently not possible to do is to zero-copy transfer data across different R instances with Arrow: one instance has to serialize NAs into a bitmask, while the other has to re-apply the mask to the data.

A simple fix to address this issue is to introduce an optional boolean flag r_nas (or embedded_nas) on Int/Bool/Time/etc. types which would signal that NA values are stored in the data itself rather than as a separate mask. It would be a simple task for a consumer implementation to automatically recode the data into their preferred representation. Such change would be strictly beneficial for R users, while not imposing any cost on Python users.

pitrou commented 6 years ago

I think this needs further discussion (perhaps on the arrow-dev mailing-list?) since it has far-reaching implications -- the new format would have to be understood by any Arrow-consuming library.

wesm commented 6 years ago

@st-pasha could you create a JIRA about this issue and possibly start a mailing list discussion? You are not the first person to ask me (even recently) about handling of null sentinels.

In short, the probability of null sentinels formally becoming part of the Arrow columnar format is very low. Having two different ways to represent nulls doesn't seem practical to me -- in my opinion, it is not the responsibility of the Arrow format to adopt the memory representations used by other systems.

However, I believe we could define field-level metadata to signal to consumers that data appearing to have no nulls actually has nulls encoded as null sentinels. Thus, R processes could transmit raw numeric data with zero copy without creating validity bitmaps. If you are sending data from R to another system that uses Arrow and is not aware of R's null sentinel conventions (and whatever metadata semaphore we might define), then you'll want to produce bitmaps and indicate the null count.

A couple of comments to other points in your issue:

because the Arrow format implements numpy NA model natively.

NumPy doesn't have an NA model

numpy-style bitmask marking

NumPy has a notion of masked arrays, using byte-width boolean arrays. It's not the same as Arrow's null bitmaps, though. The use of masked arrays is not normalized at all in the Python world -- we don't use them in pandas, for example.

wesm commented 6 years ago

Closing in favor of further discussion on mailing list and/or JIRA